Reducing allocations in parsing GEDCOM files using Span

tl;dr Span<T> allows for string manipulations without having to create temporary strings. On a custom GEDCOM parser this results in far fewer allocations, but string operations on ReadOnlySpan<char> may be slower for very small strings.

What is GEDCOM?

GEDCOM is a standard interchange file for genealogy information, and most genealogy applications and website (such as ancestry, genesreunited) support import and export to this format.

A snippet of a GEDCOM file is:

0 @I303@ INDI
1 NAME James  /Shearsby/
1 SEX M
1 BIRT 
2 DATE ABT 1846
0 @I380@ INDI
1 NAME Henry  /Adkins/
1 DEAT 
2 PLAC Warwickshire

The file is line based, and the first item in the line is the GEDCOM level. Every line with a level number greater than the current level item relates to that item. For example, in the snippet above the DATE is obviously associated to the parent BIRT (birth) record.

More information can be found at https://en.wikipedia.org/wiki/GEDCOM

I wrote a custom GEDCOM parser nearly 15 years ago, for family tree research, and used it in a family tree application http://www.taumuon.co.uk/2013/02/viewgene.html. The parser is not complete – it won’t recognise all elements present in a GEDCOM file, and it could do with a redesign. If I was to start it from scratch today I’d probably investigate using parser combinators, but it gets the job done, but it’s still an interesting exercise to see the effect of changing to Span<T>

What is Span<T>

Span<T> allows for a view or slice to be taken into contiguous memory, holding a pointer to the beginning of the slice and a slice length. The Span<T> is limited so that it can only live on the stack. This allows for manipulating sub-buffers or sub-strings in a memory safe way (performance improvements are possible that may have previously needed unsafe code).

More information available here:

Being able to take a slice of memory is useful in a parser where, for instance, a substring needs to be extracted to be passed on Int.Parse(). There are no overloads to Int.Parse that take an index and a length – so a substring needs to be obtained requiring a new allocation (There are third party libraries that do allow for generating and manipulating StringViews: https://github.com/egmkang/StringView but one downside to StringViews is that as all the views hold a reference to the original string, it’s possible to inadvertently keep the large original string alive longer than intended).

There are new string manipulation methods available in .NET core that take a ReadOnlySpan<char>, allowing a slice into a string to be manipulated (no need to allocate an intermediate string). In a GEDCOM line, the level is identified by the number before the first space, and the remainder of the string is passed on to by processed by the type-specific sub-parsers.

Parser

The code for the parser and benchmarks can be found here: www.github.com/taumuon/GedcomParser

The parser is written in a style preferring clarity over performance, it extracts out substrings where required, then compares those substring to known tag identifiers or parses numbers using Int.Parse. To improve performance, I could have written a custom Int.Parse method, along with string equality methods, but a nice side-effect of keeping the code more idiomatic is that it does provide a simple one-to-one comparison of pre-span and span code.

The changes for Span<T> were pretty trivial, instead of extracting data from the file line-by-line (using StreamReader.ReadLine()), a char array buffer is used, and StreamReader.Read() is used to populated that buffer. New-line delimited slices of ReadOnlySpan<char> s are extracted from the buffer, meaning that no allocations are necessary for the data read from the stream.

Performance

I generated a larger 50MB GEDCOM file, by taking royal92.GED and duplicating its contents 100 times (this file is definitely not a large file by data processing standards, but it’s on the medium to large size for a GEDCOM file, and benchmarkdotnet takes a while to perform all its runs with any larger file). I generated a larger 500MB file to test performance manually.

The span and non-span versions of the code have the same API, and live in different assemblies. The classes all have the same names but live in different namespaces. This is simply so that the benchmark project was able to easily reference both versions of the parser.

I setup a benchmark to run the span and non-span version of the parser with both a small text string, and the larger GEDCOM file.

Running on .NET Core 2.2., the results from benchmarkdotnet are:

MethodIsFileMeanRatioGen0/1k opAlloc Mem/op
FileParserSpanFalse28.60 us0.847.019014.41 KB
FileParserFalse31.55 us1.0011.627223.83 KB
FileParserSpanTrue1,727ms1.04124000255 MB
FileParserTrue1,665ms1.00369000757 MB

The IsFile flag indicates whether parsing is done of a small 70 line string constant (IsFile=False), or a larger file on disc.

For the scenario of parsing a small 70 line GEDCOM, there is a slight but insignificant performance improvement for Span<T>. For larger files (~50MB), Span<T> actually appears to be slower. I’m not sure what the reason for the slow down is, I’d expect that reducing allocations would improve performance, unless that was counteracted by string operations on Span<T> being slower than on String.

The benchmark shows that the garbage collector wasn’t overly stressed, with no gen1 or gen2 collections. To show that using Span would be beneficial if the garbage collector had more work to do, I created a huge (500MB) GEDCOM file, which was too large to benchmark, but I created a simple console application to parse the file and found a massive 2x speed up using the Span version of the parser (98 seconds to parse vs 226 seconds). This is an amazing change just for making a few changes in the code.

Of course, most real-world files aren’t anywhere near this size, so it’s questionable whether it would be worth making the change. In a real-world application, parsing may not be the only operation happening, the parsing logic may occur in the context of a larger application which is obviously creating many other allocations, and even though it can’t be measured by a standalone benchmark, anything to reduce unnecessary work for the garbage collector may be beneficial for performance in the application. In the real world it would be necessary to measure the performance impact of making the change in a realistic measurable scenario.

Even though the garbage collector doesn’t have much work to do when parsing a small GEDCOM snippet, there will still be many fewer gen 0 collections, so I was surprised that the performance improvement wasn’t greater, and was surprised that parsing a medium size file actually seemed slow (this is especially confusing as parsing a large file in the console application is so many times quicker, there may be something strange happening with the bencharkdotnet benchmark).

I suspected that the operations on MemoryExtensions, Equals() and IndexOf may be slower than the equivalent string operations. Running the parser under a performance profiler did indicate that these are hotspots in the code. To test this out in an isolated case, I created a separate benchmark to investigate this (the BenchmarkIndexOf class in the benchmark project).

The benchmark logic compares using IndexOf directly on a string, compared to obtaining a ReadOnlySpan<char> from the string, and using the IndexOf method on MemoryExtensions. The methods that use IndexOf on String still obtain the ReadOnlySpan<char> to ensure that the only difference in the methods is how the IndexOf is performed. This is done for a small, medium and large(-ish) string:

Method Mean
IndexOfStrSmall8.966 ns
IndexOfSpanSmall10.724 ns
IndexOfStrMed23.150 ns
IndexOfSpanMed23.224 ns
IndexOfStrLarge94.046 ns
IndexOfSpanLarge86.213 ns

For small strings, it does appear that the MemoryExtensions.IndexOf() method does have a greater overhead than String.IndexOf(), but for larger strings the performance of span is better than for string, so it appears that the operations for ReadOnlySpan<char> are optimised for larger strings compared with the operations directly on String.

Still, Span<T> has shown to have potential for fantastic performance improvements.