IAsyncEnumerable performance benefits
Asynchronous streams (i.e., the IAsyncEnumerable
interface) were one of the new features in C# 8. However, it didn't get as much attention as some others. Even today, the feature is rarely used and not well known among C# developers, although it can make the code faster and easier to understand in some cases.
Nevertheless, classes in the .NET base class library regularly expose the IAsyncEnumerable
interface when it makes sense. The JsonSerializer
class, for example, can return it when deserializing a JSON array. This can be particularly useful when you need to deserialize multiple arrays.
Without asynchronous streams, if you didn't want the consuming code to know how the data is loaded, you would load it from all the files into a list and return that:
public static async Task<List<TimeValue>> LoadAsListAsync(string folder)
{
var files = Directory.EnumerateFiles(folder, "*.json");
var allItems = new List<TimeValue>();
foreach (var file in files)
{
using var stream = File.OpenRead(file);
var jsonItems = await JsonSerializer.DeserializeAsync<List<TimeValue>>(
stream,
jsonSerializerOptions
);
if (jsonItems != null)
{
allItems.AddRange(jsonItems);
}
}
return allItems;
}
This approach isn't very efficient for large data sets, especially if you're only interested in aggregated data and don't need to keep the individual items. The following method is an example of such a consumer, only calculating the daily sums:
public static async Task<Dictionary<DateOnly, int>> AggregateFormListAsync(
string folder
)
{
var aggregates = new Dictionary<DateOnly, int>();
var items = await JsonLoader.LoadAsListAsync(folder);
foreach (var item in items)
{
var date = DateOnly.FromDateTime(item.DateTime);
if (!aggregates.TryGetValue(date, out var aggregate))
{
aggregate = 0;
}
aggregates[date] = aggregate + item.Value;
}
return aggregates;
}
Without using asynchronous streams, you could change the loading code to only load one file at a time and never need to hold all the data in memory:
public static IEnumerable<Task<List<TimeValue>?>> LoadAsTaskList(string folder)
{
var files = Directory.EnumerateFiles(folder, "*.json");
foreach (var file in files)
{
using var stream = File.OpenRead(file);
yield return JsonSerializer
.DeserializeAsync<List<TimeValue>>(stream, jsonSerializerOptions)
.AsTask();
}
}
However, this requires the consuming code to be aware that the data is loaded in multiple parts. Because of that, two loops are required to process all the data:
public static async Task<Dictionary<DateOnly, int>> AggregateFromTaskListAsync(
string folder
)
{
var aggregates = new Dictionary<DateOnly, int>();
var tasks = JsonLoader.LoadAsTaskList(folder);
foreach (var task in tasks)
{
var items = await task;
if (items != null)
{
foreach (var item in items)
{
var date = DateOnly.FromDateTime(item.DateTime);
if (!aggregates.TryGetValue(date, out var aggregate))
{
aggregate = 0;
}
aggregates[date] = aggregate + item.Value;
}
}
}
return aggregates;
}
With asynchronous streams, this implementation detail can be hidden from the consuming code. Items are returned one by one, and the loading code simply switches to the next file when it runs out of items in the current file:
public static async IAsyncEnumerable<TimeValue> LoadAsAsyncEnumerable(string folder)
{
var files = Directory.EnumerateFiles(folder, "*.json");
foreach (var file in files)
{
using var stream = File.OpenRead(file);
var jsonItems = JsonSerializer.DeserializeAsyncEnumerable<TimeValue>(
stream,
jsonSerializerOptions
);
await foreach (var jsonItem in jsonItems)
{
if (jsonItem != null)
{
yield return jsonItem;
}
}
}
}
The consuming code is practically identical to the first one, which was aggregating the data that was fully loaded into memory in advance. It's just using await foreach
instead of foreach
:
public static async Task<Dictionary<DateOnly, int>> AggregateFromAsyncEnumerableAsync(
string folder
)
{
var aggregates = new Dictionary<DateOnly, int>();
var items = JsonLoader.LoadAsAsyncEnumerable(folder);
await foreach (var item in items)
{
var date = DateOnly.FromDateTime(item.DateTime);
if (!aggregates.TryGetValue(date, out var aggregate))
{
aggregate = 0;
}
aggregates[date] = aggregate + item.Value;
}
return aggregates;
}
But there is a significant difference in performance between the two approaches: the first one loading all the data in advance and the last one loading the data is it being processed. I used BenchmarkDotNet to do some measurements. The actual numbers will depend on the data, but for my use case it was 20% faster, and it allocated over 25% less memory:
Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|
AggregateFormListAsync | 1.437 s | 0.0177 s | 0.0165 s | 22000.0000 | 14000.0000 | 6000.0000 | 355.81 MB |
AggregateFromAsyncEnumerableAsync | 1.163 s | 0.0050 s | 0.0046 s | 16000.0000 | 1000.0000 | - | 256.51 MB |
I pushed full source code and sample data to my GitHub repository, so that you can try it out yourself. The code contains a few more benchmarks in addition to these two:
- One for the second approach from this post, which requires two loops in the consuming code. Not surprisingly, the performance falls in between the other two approaches.
- One for alternative client code implementation for
AggregateFormListAsync
using LINQ instead of the explicit loop. This makes the performance even slightly worse. - One for alternative client code implementation for
AggregateFromAsyncEnumerableAsync
using theSystem.Linq.Async
NuGet package. Surprisingly, the performance of this one is the worst of them all. However, it's perfectly possible that my code is at fault, not the library, as I have almost no experience using it.
Asynchronous streams seem to be an often overlooked feature of C# and .NET. While they might not always be useful, they can have a noteworthy positive impact on the code and its performance when they are a good fit for the task at hand.