## Reducing allocations in parsing GEDCOM files using Span

tl;dr Span<T> allows for string manipulations without having to create temporary strings. On a custom GEDCOM parser this results in far fewer allocations, but string operations on ReadOnlySpan<char> may be slower for very small strings.

## What is GEDCOM?

GEDCOM is a standard interchange file for genealogy information, and most genealogy applications and website (such as ancestry, genesreunited) support import and export to this format.

A snippet of a GEDCOM file is:

0 @I303@ INDI
1 NAME James  /Shearsby/
1 SEX M
1 BIRT
2 DATE ABT 1846
0 @I380@ INDI
1 DEAT
2 PLAC Warwickshire

The file is line based, and the first item in the line is the GEDCOM level. Every line with a level number greater than the current level item relates to that item. For example, in the snippet above the DATE is obviously associated to the parent BIRT (birth) record.

I wrote a custom GEDCOM parser nearly 15 years ago, for family tree research, and used it in a family tree application http://www.taumuon.co.uk/2013/02/viewgene.html. The parser is not complete – it won’t recognise all elements present in a GEDCOM file, and it could do with a redesign. If I was to start it from scratch today I’d probably investigate using parser combinators, but it gets the job done, but it’s still an interesting exercise to see the effect of changing to Span<T>

## What is Span<T>

Span<T> allows for a view or slice to be taken into contiguous memory, holding a pointer to the beginning of the slice and a slice length. The Span<T> is limited so that it can only live on the stack. This allows for manipulating sub-buffers or sub-strings in a memory safe way (performance improvements are possible that may have previously needed unsafe code).

Being able to take a slice of memory is useful in a parser where, for instance, a substring needs to be extracted to be passed on Int.Parse(). There are no overloads to Int.Parse that take an index and a length – so a substring needs to be obtained requiring a new allocation (There are third party libraries that do allow for generating and manipulating StringViews: https://github.com/egmkang/StringView but one downside to StringViews is that as all the views hold a reference to the original string, it’s possible to inadvertently keep the large original string alive longer than intended).

There are new string manipulation methods available in .NET core that take a ReadOnlySpan<char>, allowing a slice into a string to be manipulated (no need to allocate an intermediate string). In a GEDCOM line, the level is identified by the number before the first space, and the remainder of the string is passed on to by processed by the type-specific sub-parsers.

## Parser

The code for the parser and benchmarks can be found here: www.github.com/taumuon/GedcomParser

The parser is written in a style preferring clarity over performance, it extracts out substrings where required, then compares those substring to known tag identifiers or parses numbers using Int.Parse. To improve performance, I could have written a custom Int.Parse method, along with string equality methods, but a nice side-effect of keeping the code more idiomatic is that it does provide a simple one-to-one comparison of pre-span and span code.

The changes for Span<T> were pretty trivial, instead of extracting data from the file line-by-line (using StreamReader.ReadLine()), a char array buffer is used, and StreamReader.Read() is used to populated that buffer. New-line delimited slices of ReadOnlySpan<char> s are extracted from the buffer, meaning that no allocations are necessary for the data read from the stream.

## Performance

I generated a larger 50MB GEDCOM file, by taking royal92.GED and duplicating its contents 100 times (this file is definitely not a large file by data processing standards, but it’s on the medium to large size for a GEDCOM file, and benchmarkdotnet takes a while to perform all its runs with any larger file). I generated a larger 500MB file to test performance manually.

The span and non-span versions of the code have the same API, and live in different assemblies. The classes all have the same names but live in different namespaces. This is simply so that the benchmark project was able to easily reference both versions of the parser.

I setup a benchmark to run the span and non-span version of the parser with both a small text string, and the larger GEDCOM file.

Running on .NET Core 2.2., the results from benchmarkdotnet are:

 Method IsFile Mean Ratio Gen0/1k op Alloc Mem/op FileParserSpan False 28.60 us 0.84 7.0190 14.41 KB FileParser False 31.55 us 1.00 11.6272 23.83 KB FileParserSpan True 1,727ms 1.04 124000 255 MB FileParser True 1,665ms 1.00 369000 757 MB

The IsFile flag indicates whether parsing is done of a small 70 line string constant (IsFile=False), or a larger file on disc.

For the scenario of parsing a small 70 line GEDCOM, there is a slight but insignificant performance improvement for Span<T>. For larger files (~50MB), Span<T> actually appears to be slower. I’m not sure what the reason for the slow down is, I’d expect that reducing allocations would improve performance, unless that was counteracted by string operations on Span<T> being slower than on String.

The benchmark shows that the garbage collector wasn’t overly stressed, with no gen1 or gen2 collections. To show that using Span would be beneficial if the garbage collector had more work to do, I created a huge (500MB) GEDCOM file, which was too large to benchmark, but I created a simple console application to parse the file and found a massive 2x speed up using the Span version of the parser (98 seconds to parse vs 226 seconds). This is an amazing change just for making a few changes in the code.

Of course, most real-world files aren’t anywhere near this size, so it’s questionable whether it would be worth making the change. In a real-world application, parsing may not be the only operation happening, the parsing logic may occur in the context of a larger application which is obviously creating many other allocations, and even though it can’t be measured by a standalone benchmark, anything to reduce unnecessary work for the garbage collector may be beneficial for performance in the application. In the real world it would be necessary to measure the performance impact of making the change in a realistic measurable scenario.

Even though the garbage collector doesn’t have much work to do when parsing a small GEDCOM snippet, there will still be many fewer gen 0 collections, so I was surprised that the performance improvement wasn’t greater, and was surprised that parsing a medium size file actually seemed slow (this is especially confusing as parsing a large file in the console application is so many times quicker, there may be something strange happening with the bencharkdotnet benchmark).

I suspected that the operations on MemoryExtensions, Equals() and IndexOf may be slower than the equivalent string operations. Running the parser under a performance profiler did indicate that these are hotspots in the code. To test this out in an isolated case, I created a separate benchmark to investigate this (the BenchmarkIndexOf class in the benchmark project).

The benchmark logic compares using IndexOf directly on a string, compared to obtaining a ReadOnlySpan<char> from the string, and using the IndexOf method on MemoryExtensions. The methods that use IndexOf on String still obtain the ReadOnlySpan<char> to ensure that the only difference in the methods is how the IndexOf is performed. This is done for a small, medium and large(-ish) string:

 Method Mean IndexOfStrSmall 8.966 ns IndexOfSpanSmall 10.724 ns IndexOfStrMed 23.150 ns IndexOfSpanMed 23.224 ns IndexOfStrLarge 94.046 ns IndexOfSpanLarge 86.213 ns

For small strings, it does appear that the MemoryExtensions.IndexOf() method does have a greater overhead than String.IndexOf(), but for larger strings the performance of span is better than for string, so it appears that the operations for ReadOnlySpan<char> are optimised for larger strings compared with the operations directly on String.

Still, Span<T> has shown to have potential for fantastic performance improvements.

## TPL Dataflow

This blog post is coming a couple of weeks late due to my copy of Windows becoming corrupted – and strangely the only things I didn’t have backed up were this Oscilloscope app!

Anyway, last time I took a look at charting performance, but this post will investigate TPL Dataflow.

In my original F# post, I discussed that I felt that the code would have been better implemented in RX, which I then went and did in this post. However, RX isn’t the only recent new technology coming out of Microsoft that deals with streams of data. TPL Dataflow has a confusingly similar overlap with the usages of RX, and this is what its whitepaper has to say:

“Astute readers may notice some similarities between TPL Dataflow and Reactive Extensions (Rx), currently available as a download from the MSDN Data developer center. Rx is predominantly focused on coordination and composition of event streams with a LINQ-based API, providing a rich set of combinators for manipulating IObservable<T>s of data. In contrast, TPL Dataflow is focused on providing building blocks for message passing and parallelizing CPU- and I/O-intensive applications with high-throughput and low-latency, while also providing developers explicit control over how data is buffered and moves about the system. As such, Rx and TPL Dataflow, while potentially viewed as similar at a 30,000 foot level, address distinct needs. Even so, TPL Dataflow and Rx provide a better together story.”

That does sound very interesting – who doesn’t want better performance!

### Implementation

I’ll dive in straight away and look at some code. The code is structured similarly to the RX code in my previous post.

The microphone still provides an IObservable<float[]> via the GetStream() method. The TaskSchedulers are specified explicitly throughout, so that the code can be unit tested (I’m using a MockSynchronizationContext).

The transformManyBlock behaves in the same way to the SelectMany operator, it separates the float array into its constituent parts.

The broadcast block is analogous to a connectable observable – many blocks can be linked to it to receive the same messages.

The transform many block is then explicitly linked to the broadcast block, meaning that its outputs make it into the broadcast block’s input.

#### Oscilloscope/Buffered View

The code to deal with the buffered display of data is:

The batch block behaves the same as LINQ’s Buffer operator (a bit confusingly when TPL Dataflow’s Buffer means something different). This simply takes the incoming float values and packages them into an array.

The ActionBlock is the final point on the chain – it takes the values it receives and outputs them to the screen (via oscilloscopeOutput) – this is the equivalent code to the code in the Observable Subscription in the RX version of my code.

#### Sliding Window View

Again, the sliding window view is structured similarly to the previous RX code:

A sample block is created so that every n samples are fed into the SlidingWindowBlock. This returns a new float[] array representing the new window of data on every input.

The second sample block ensures that not every window of data is drawn (so that the chart is refreshed at 40fps rather than 400fps!)

Finally, the observable is subscribed to, and the messages are posted into the transformManyBlock:

Whereas the RX guidelines recommend creating new operators based on existing operators, TPL Dataflow encourages building up custom blocks (though they can be built out of combining existing blocks).

The SlidingWindowBlock is a slightly modified version of the one provided in the white paper, and here (http://msdn.microsoft.com/en-us/library/hh228606(v=vs.110).aspx) The one in the example only posts a window once it has received a full window worth of data, whereas I want to start drawing the windows as soon as data arrives.

The SampleBlock is simple, it yields a value once it has received enough inputs, discarding the intermediate ones. This is structured more efficiently than my RX implementation – that one buffers the values, and then only take the last value out of the buffer.

#### Performance

I removed the updating of the chart, to look purely at what the performance of the app is in taking the input from the microphone, and shaping it to be in the correct format for output.

RX Performance:

TPL Dataflow:

The processor usage between the apps is in the same ballpark, but the TPL Dataflow app has many more garbage collections, and # bytes in all heaps also is running slightly higher. I had a quick look in the profiler, and it seems that there are many Task allocations from the Dataflow.Post() operation from within the SlidingWindow and Sample implementations.

The extra generation 1 collections don’t really matter for this application with the functionality that it currently has – the % time in GC barely registers, and there are no Gen 2 garbage collections over the minute or so that the app was running.

Once the app is doing more calculations and drawing more complex charts it would be interesting to see whether any latency spikes due gen 2 GCs cause similar slowdowns to the one discussed at the start of my previous post.

It would be an interesting exercise to limit the amount of GCs throughout the app, for instance there’s no need for the microphone access observable to return an IObservable<float[]> instead of IObservable<float>; currently every read from the microphone allocates a new float[]. Similarly, new List<Point> are created to more easily interface with DynamicDataDisplay – it would be better to change the types of data that D3 takes to be more observable-friendly, and to save having so many allocations. Again, there’s not much point doing this, other than an interesting theoretical exercise, until the garbage collection overhead proves to be a performance issue.

##### Conclusion

For an application as simple as this, there isn’t any benefit to using TPL Dataflow – it is a powerful library, with functionality such as blocks being able to decline offered blocks, and request them later, which would be difficult to implement in RX. As my app doesn’t (currently) need that level of functionality, there’s no benefit to using the library.

I may revisit this in the future – if I had some computationally expensive operation (FFT for instance) where I’d want greater control over the flow of data through the system.

## Charting Performance

I wasn’t going to blog about performance until a later post, but there was a performance issue I wanted to tackle before discussing some other things.

The frame rate sometimes seemed to be as expected, but sometimes slow, and the application was quite unresponsive when handling the Stop button command.

To help nail this down I created a Buffer Frame Rate performance counter, so that I monitor it along with another couple of metrics (I chose to see the count of gen1 and gen2 garbage collections, along with the % time in GC, and % processor time).

The framerate average isn’t bad, but the frame rate is not consistent; it regularly drops below 10 frames per second. This post will discuss optimising the performance of the application to keep the frame rate steady and ensure that the app is responsive.

Performance increases should generally focus on reducing unnecessary work in the whole application, and not micro-optimisations. In this case, it’s realising that the top chart is drawing at 40 frames per second (a buffer of 1000 data points into a 40KHz source), whereas the bottom chart is updating at 400 frames per second (sampling the data to 100 points, and updating the chart on each of those sample points).

However, as there are other types of charts I want to produce it’s worth investigating where the time is spent.

First off, I tried turning off anti-aliasing, and moving over to the d3future project (http://d3future.codeplex.com ), to see if that made any performance difference, but the performance was pretty similar, with the CPU pegged at 100%. A further run generated this interesting trace:

The frame count starts initially high, but after a while performance drops off dramatically, which is followed by some long spikes in the % Time in GC. The drop-off maybe seems to correspond with one of the gen 2 collections. This situation wasn’t so reproducible that I could catch it under a profiler.

My next step was to profile to see where the application was spending its time, just using the Visual Studio profiler. The application’s hot path was in the D3 LineChart’s UpdateCore method, where it was updating the chart’s Bounds using data from the EnumerableDataSource –it was iterating over the whole collection to find the x and y min and max values (to fit the chart to the data area).

It seems unnecessary to iterate over the whole data source to get the minimum and maximum x values – for a linear chart (i.e. not a scatter chart) it would be expected that the first point is at the minimum x value and the last point the maximum.

I created a chart derived from LineChart where I could instead pass in a List of points for the datasource – this means that I could get the last point without enumerating the whole collection, allowing the min and max x values to be quickly found. For the y values, I happen to want the chart to be a fixed axis and not scale to fit the data (instead scaled by a user-configurable gain factor), so there was no need at all to iterate over the collection.

The chart looked like this after those changes:

The frame rate is pretty consistent now. The gen 1 allocations are still occurring at a rate of 200 per minute, but the application is now responsive to the stop click, stopping immediately.

Profiling with memory allocations turned on showed that the types with most instances was System.Byte[] with 32% of the instances allocated in the profiling period.

ConvertByteArrayToFloatArray() previously looked like this:

And after changing this to not use LINQ:

Gives the following:

Performance is pretty similar to previous, but the gen 1 collections are occurring at a slightly lower rate. Now, profiling tells me that the types with most instances allocated is pretty much split between System.Action, System.Object, and System.Reactive.Disposables.SingleAssignmentDisposable.

Profiling at this point identifies the majority of the work now being in LineGraph.UpdateCore() where its transforming all points from their data coordinates into screen coordinates.

The transformation of coordinates for many points is embarrassingly parallel (think GPGPU). Potentially, the GPU could be put to use by simply drawing the points untransformed, adding a transformation onto the drawing context.

Further performance increases could be done for the sliding window chart: A sliding window chart redisplays the same points again and again, as they move left along the x axis. It essentially transforms the same point many times into the new viewport. Doing a transform to move an existing point to the left could be more efficient than transforming again from data to screen coordinates.

This blog so far has looked at some micro-optimisations focussing on identifying areas where work can be reduced. As I said in the beginning, if I wasn’t so interested in optimising for future requirements, I’d simply ensure that I was doing less drawing straight away.

Anyway, the sliding observable can be change to only take every 10th windowed set of values, by introducing another Sample() on the result of the WindowWithCount():

There isn’t actually that much difference in real-world performance.

The rate of gen 1 collections is lower still than previous, but the profiler shows that app is still spending the majority of its time in drawing the charts. Surprisingly the CPU usage doesn’t seem to have dropped much, although in the profiler the time can now be seen to be split between the drawing of the two charts, instead of dominated by the drawing of the sliding window chart.

The micro-optimisations were still worth investigating, as I have charts in mind that will draw multiple traces simultaneously.