Why not to use In Memory Collections in LINQ to Entities

Published: 25th May 2015

Recently I have been looking into performance issues with one of our products. Data access is using entity framework talking to a SQL Database with a lot of queries being built using LINQ. For the most part this is good, however it is very important to stop and look at the SQL that is generated by these LINQ queries. In the example I am about to describe, the SQL that was generated was quite excessive.

We have an “Integration” layer that facilitates data transfers between an employee in two different systems. To achieve this it needs to marry up an employee in one system with the other. Because these systems have their own databases with their own data context, we pull data from both systems and merge it in memory. On a number of occasions we simply limit the data from one system by a list of valid ids from the other.

The following piece of code has been sanitised but is structurally the same. We gather a collection of valid Ids from the target system. We then query the data from the source system and filter it by the ids in the target system. In the example below we directly filter this in a LINQ to Entities statement.

var inMemoryCollection = inMemoryCollectionSource.Select(x => x.Id).ToList();

var emps = (from employee in context.Employees
            join subRecord in context.EmployeeSubRecord on employee.EmployeeId equals subRecord.EmployeeID into joinedRecords
            from joined in joinedRecords.DefaultIfEmpty()
            where employee.SomeBoolean && inMemoryCollection.Contains(employee.EmployeeId)
            select new
                {
                    EmployeeId = employee.EmployeeId,
                    DepartmentId = joined == null || !joined.PayrollDepartmentID.HasValue ? 0 : joined.PayrollDepartmentID.Value,
                    EmployeeReference = employee.EmployeeReference,
                    Forename = employee.Forename,
                    Surname = employee.Surname
                }).ToList();

For small sets of data this is perfectly performant, however this does not scale. If the collection of ids being used to filter the data is large enough, the SQL that is constructed is formed differently resulting in a unioned set of data per item in the collection. For 10,000 Ids in the collection, the constructed SQL was taking 45 seconds to execute and it was so large that SQL Profiler would not display the generated SQL.

So it was time to change tactics on how we filter this data. The following snippet is the change that was made:

var inMemoryCollection = inMemoryCollectionSource.Select(x => x.Id).ToList();

var emps = (from employee in context.Employees
            join subRecord in context.EmployeeSubRecord on employee.EmployeeId equals subRecord.EmployeeID into joinedRecords
            from joined in joinedRecords.DefaultIfEmpty()
            where employee.SomeBoolean
            select new
                {
                    EmployeeId = employee.EmployeeId,
                    DepartmentId = joined == null || !joined.PayrollDepartmentID.HasValue ? 0 : joined.PayrollDepartmentID.Value,
                    EmployeeReference = employee.EmployeeReference,
                    Forename = employee.Forename,
                    Surname = employee.Surname
                }).ToList();

emps = emps.Where(x => inMemoryCollection.Contains(x.EmployeeId)).ToList();

I removed the collection of Ids from the LINQ to Entities statement. This now meant that we were potentially gathering more information from the SQL Database, however the constructed SQL executed in the background was far simpler and as a result far quicker. I then filtered the data from the source system in memory with the same collection of ids from the target system. The result of this was the exact same set of data we wanted originally. The difference however was significant. The web-page changed from responding in 45 seconds to responding in a fraction of a second.

So in summary, don’t use collections in LINQ to Entities unless its guaranteed to be a small collection and more importantly cache the results of large queries.

Development