Njoki

It’s tradition in Croatia to eat Pašticada sa domaćim njokama (pot roast with home made potato gnocchi), and the njoki/gnocchi can be made in advance. They’re not at all tricky to make, and are delicious covered in thick rich gravy.

Take some fluffy potatoes (I’ve got Maris Piper) and place them into a pan of cold salted water, and bring to the boil.

DSC0993625255B425255D

Once boiled, the skins of the potatoes turn really soft, and can be peeled off. Unless you’ve got asbestos fingers you might find it easier to use a sharp knife.

Now, prepare the work surface with flour and rice the potatoes. Ricing instead of mashing helps the moisture in the potatoes evaporate. If you think the potatoes are soggy you could pop them into an oven for a couple of minutes after peeling. We used strong bread flour – apparently the protein in the flour helps to soak up moisture.

DSC0994425255B425255D

Make a well in the potatoes and add an egg, some salt, and a generous knob of butter.

DSC09948_thumb25255B125255D

Bring all the ingredients together to form a dough, but don’t over-work them. You’re not baking bread, so definitely don’t go kneading it!

DSC0994925255B425255D

Roll the dough into even portions, and then take each portion and roll it into long sausages. Chop each sausage evenly.

DSC09954_thumb25255B125255D

Some recipes recommend pressing each gnocchi against a fork, but rolling out against a grater adds a better looking finish, and helps the gnocchi soak in and hold a generous amount of the rich gravy.

Press the gnocchi flat against the grater, and push downwards so that it flattens out against the grater. Then, roll it around to form the gnocchi shape.

DSC09956_thumb25255B125255D

Binomial Tree Option Pricing in F#

This post will look at implementing the binomial tree algorithm in an imperative style, and refactoring it to be more functional.

The wikipedia page on the binomial options pricing model provides pseudo-code for the pricing of an American Put option.

That code can be implemented almost identically in F#.

clip_image002_thumb25255B125255D

and is called as follows:

clip_image004_thumb25255B125255D

Functional programmers have probably died a little on the inside on seeing that code, but one of the good thing about F# being multi-paradigm is that it’s easy to take code of this form and implement it, and test it for correctness before refactoring.

I did actually look at whether the results of the code were as expected at this point, but I’ll cover this later in this blog post.

The first thing to do is to change the initial values V to be created more functionally:

clip_image00625255B425255D

 

 

Now, the main algorithm consists of two parts, an outer loop where each ‘level’ of the tree is calculated stepping up towards the root, using the results of the previous level’s nodes (the nodes nearest to the leaves). The inner loop iterates over all of the nodes at a given level calculating the option’s value at that node.

First off, the outer loop can be replaced with a recursive method:

clip_image00825255B425255D

The next change is quite minor – it’s for each pass of the inner loop to create a new array, instead of mutating the vi array:

clip_image01025255B425255D

This change now means that there’s no need for any mutable state, so the initial values can be created as a sequence instead of an array, and the inner loop can be written in a functional style:

clip_image01225255B425255D

I find this code style is clearer to understand than following the array mutations to see what each part of the code is doing and how those parts interact.

If I was implementing this in C# I’d probably change the variable names to be more descriptive (e.g. strikePrice, riskFreeRate etc) but as one of the key advantages of F# is its succinctness it makes more sense to leave the well-known symbols in. Anyone implementing the algorithm with knowledge of the domain would know what the symbols mean without the more verbose names.

Validating the results

The calculated option value with the requested parameters is 4.486090. Running the EqutyOptions example in the .NET port of QuantLib, QLNet (http://sourceforge.net/projects/qlnet), with the same parameters gives the following results:

clip_image01425255B425255D

The figures are in the same ballpark, but the difference is curious.

Chapter 45 of GPU GEMS 2  creates the initial values in this way:

clip_image016_thumb25255B125255D

 

This gives a result of 4.486375 which still doesn’t match QLNet, but is nearer.

The result of running the Cox version of Bubo’s code from here http://cs.hubfs.net/topic/None/59053 also equals 4.486375, which matches the results of using this initialisation function. As an aside, I like how that examples shows how F# partial function application allows the variations on the algorithm to be composed, and much more succinctly than using OO.

Bubo’s Tian model does match QuantLib’s result exactly. I’ll have to dig a bit deeper to figure out which variations on the binomial tree model these different examples are using.

The code is available here: https://github.com/taumuon/FSharp-Binomial-Tree

Metro Bullet source code

I’ve updated the demo that shows the Bullet Physics Engine in a DirectX 11 Metro application to build in the Release Preview drop of Visual Studio (original blog post: http://www.taumuon.co.uk/2012/04/bullet-physics-and-directx-11-on-windows-8-metro.html ).

The code needs some tweaks – I need to change the manifest, and change some stuff over to Platform::Agile, but it builds and runs so I thought I’d get it out there.

The code is available at: https://github.com/taumuon/Metro-Bullet

More playing with monads

This blog post will look at using the continuation monad to sum the nodes in a tree, and then look at a case where a more complex monad is needed.

The book Real World Functional Programming gave an example of an unbalanced tree, which was deeply nested such that recursing over the tree to sum the values would blow the stack:

clip_image001_thumb25255B325255D

clip_image00225255B425255D

The solution provided was to sum the tree using continuation passing:

clip_image00325255B425255D

This is now able to sum the tree without crashing, but the code doesn’t flow as well as in the original sumTree method. However, monads can come to the rescue again, in this case the continuation monad:

clip_image00425255B425255D

The code flow now is easier to understand. The continuation passing style and continuation monad turns the code flow inside out, which isn’t a big deal, but means that the func to perform on the result needs to be passed in:

clip_image00525255B425255D

Parallel sum

This is all well and good, but to exploit multicore machines, I want to take advantage of the parallelism in the tree summation algorithm, by summing the different branches of the nodes using tasks. A naïve implementation to do this may be:

clip_image00625255B425255D

 

The problem with this is that it runs slower than the sequential case for a large balanced tree –if there are only 4 cores in the system it’s counter-productive to create thousands of tasks. The current depth of the tree should be passed around, and tasks spawned until traversal reaches that depth:

clip_image007_thumb25255B225255D

 

 

This is now faster for a large balanced tree, but for the deeply nested unbalanced tree it throws a StackOverflowException. Note that sumTreeParallel coped with arbitrarily nested trees, as the work to be done no longer kept being pushed onto the stack.

More state is being passed around (the current depth), which is external to the main business logic – it does feel like the state monad could be used here. First though, this is fixed up so that it has the efficiency of the parallel summation, but can cope with arbitrarily deeply nested trees:

clip_image00825255B425255D

Urghhhh. The code is definitely more convoluted now – I played with using the state monad to pass the depth around, but it’s not tail recursive. I thought that it might be possible to make use of the continuation monad here, but the problem is, is that it really needs aspects of both the state and continuation monads. I’m not sure how monad combinations work in F#, but I thought I’d throw this out here to see if anyone shows me the light. The post The Mother of All Monads does say that all other monads could be implemented in terms of the continuation monad, but I couldn’t see an example of anyone implementing the state monad in terms of the continuation monad.

State Monad in C# and F#

My colleagues and I recently have been trying to gain a deeper understanding of monads, other than using LINQ and RX. Watching all the channel9 videos on monads, the concept on the surface seemed quite simple, but we were felt that we were somehow missing some key point. After reading various blog posts, the penny is finally dropping.

Some post I read somewhere (I can’t remember where) said that it’s easier to understand the individual monads before trying to understand the whole generalised concept. Mike Hadlow’s blog series (http://mikehadlow.blogspot.co.uk/2011/01/monads-in-c1-introduction.html) really explain well the identity and maybe monads (which we already understood, which is just as well as we’re using the maybe monad in our codebase). Unfortunately he didn’t explain the state monad.

The only information we could find on the state monad in C# was Brian Beckman’s Channel 9 video: http://channel9.msdn.com/Shows/Going+Deep/Brian-Beckman-The-Zen-of-Expressing-State-The-State-Monad

We found Brian Beckman’s example overly complex, there’s quite a lot going on in that video (and the code needed de-obfuscating to read more like idiomatic C#), but it’s probably a good second or third example once the basics are understood. I wanted to look at a simpler example, and found this article: http://ertes.de/articles/monads.html#section-6 the explanation of the get and put methods show pretty clearly how the state monad works, and you should probably read and understand that section before carrying on here, as I’m going to jump straight on to show how the random number generator example looks in F# and C#.

C# implementation

First off, here’s an OO random number generator:

clip_image00125255B425255D

It is easy to add three random numbers together using an instance of the generator.

clip_image00225255B525255D

We can make NextShort purely functional, so that calling it has no side-effects, by passing the current state in, and returning the new state back out, along with the result:

clip_image00325255B525255D

To add three numbers, the state has to be kept around, and passed to each of the calls in turn. This obfuscates the main business logic.

clip_image00425255B425255D-1

Jumping straight to the punch line, using a state monad means that the boilerplate state management can be hidden away so that the algorithm reads as clearly as it did before:

clip_image00525255B425255D-1

 

This is a trivial example, but hopefully it makes the point. The rest of this blog post will show the implementation, along with the same in F#.

First off, we have a state class, which holds a Func that given a state, returns a tuple of the state and value. The F# example doesn’t bother with the State class, and I could have made the C# version simply use the Func everywhere, but it seemed clearer to me to have the wrapper object just to get a bit of context.

clip_image00625255B425255D-1

I created a non-generic State class as somewhere convenient to put the Get and Set methods:

clip_image00725255B425255D

The meat of the monad lives in extension methods in a StateEx class:

clip_image00825255B625255D

and SelectMany is implemented in terms of Bind the same as in Mike Hadlow’s example for the Identity and Maybe monads:

clip_image00925255B425255D

The computation to compose can be expressed in the GetRandom method:

clip_image01025255B425255D

This returns an operation (a Func in the State container) that invokes NextShort with the old state, and returns a computation with the new state and value.

The resulting composed operation using LINQ is described above, but this is what it looks like desugared:

clip_image01125255B425255D

F# implementation

The code for the random number generator and composed operation look similar to the C#:

clip_image012_thumb25255B125255D

This has the same problem of explicitly passing the state around. I can ruin the surprise again and show how using monads hides the explicit state passing:

clip_image01325255B425255D

This code makes use of F#’s support for monads via Computation Expressions. The code for the state builder is from here: http://www.navision-blog.de/2009/10/23/using-monads-in-fsharp-part-i-the-state-monad/

clip_image014_thumb25255B125255D

The Haskell example also had the >> operator, but of course F# uses that as the forward composition operator J

Instead I’ve implemented:

clip_image015_thumb25255B125255D

getRandom looks pretty much identical to the Haskell example:

clip_image016_thumb25255B125255D

 

The composed version using the StateBuilder is shown above, but here’s the desugared version:

clip_image01725255B425255D

The F# version is much clearer to read than the C#, in part due to the type inference meaning that the code has less pointless fluff. Having functions live in the module and not as part of a class aids that too (can simply call setState instead of State.Set in the C#)

Following through the simpler example really helped my understanding, and hopefully will see examples in the future where it will be sensible to use the state monad in my coding.

Monte Carlo C++ AMP

This blog is starting to look very inconsistent – my last blog post was talking about starting to write a game, and now I’ve gone onto a totally different topic. Due to time the game’s gone onto, if not the back burner, then a gentle simmer. I’ve got to balance doing cool stuff with keeping on top of various technologies that may be relevant as a contractor, and having a life!

This blog will describe using C++ AMP for calculating Pi; the calculation of Pi is a fairly good example of Monte Carlo calculations, as the algorithm’s simple, and we all know the result.

First off, this is what the algorithm looks like in C#:

clip_image001_thumb25255B125255D

It simply finds the number of circles representing x-y coordinate pairs, whose vector magnitude falls within the unit circle. This ratio is Pi / 4.

The random class is the Mersenne Twister code from here: http://takel.jp/mt/MersenneTwister.cs. The results are:

clip_image002_thumb25255B125255D

Replacing the Mersenne Twister class with System.Random resulted in code which was approximately 30% slower. I’m running this on a dual-core laptop, but have not parallelised it, as I didn’t fancy porting over a Parallel Random Number Generator (PRNG) myself.

Tomas Petricek has an example of using the GPU to calculate Pi using F# here, but his example generates the random numbers on the CPU and uploads them to the GPU to do the calculations and reduction.

C++ AMP Version

Microsoft have just released a C++ AMP Random Number generator library to codeplex http://amprng.codeplex.com/, but I’m using BharathM port of the parallel Mersenne Twister described here: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/20/mersenne-twister-sample-using-c-amp.aspx

His example generates random numbers into a rank 2 array of size 4096 * 2, but I’ve modified g_n_per_RNG to be 256.

clip_image00425255B425255D

 

 

The first thing I do is to pair up those random numbers into x-y coordinates and to find the square of their magnitude:

clip_image00625255B425255D

clip_image00825255B425255D

Then I determine whether the magnitude of these coordinates fall outside of the unit circle:

clip_image01025255B425255D

 

 

clip_image01125255B425255D

The final step is to do a parallel reduce of the resulting array. This is using the just-released reduction method from the C++ AMP algorithms library: http://ampalgorithms.codeplex.com/

clip_image01325255B425255D

The reduce function behaves similar to STL’s std::accumulate in that you can pass in the binary operation to perform on each item. The function takes a rank 1 array, so I’m using view_as to change rank.

To perform the timings I run the algorithm twice, once to warm up (ensure all shaders etc are compiled – see http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/1c6c9f04-1f9f-4f44-99f3-154d991ae5ba )

The results I obtained are:

clip_image015_thumb25255B125255D

Calculating the result for half a million random numbers took approximately 16 milliseconds, whereas for the single threaded C# code a million iterations took 44 ms, so accounting for using both cores the performance seems pretty similar. This is quite disappointing, even though the GPU isn’t the most powerful (a mobile 5470), the CPU isn’t the most powerful either, so I was hoping for near an order of magnitude speed up. I wasn’t doing anything clever with tiling, and there may be other bits of the API I’m not using correctly. I’ll have to get hold of Kate Gregory’s book and try with different examples.

Bullet Physics and DirectX 11 on Windows 8 Metro

I’ve got a game idea which I think may work on a tablet so I’ve created the following ‘demo’. I’ve hooked together the Bullet Physics engine with some 3D graphics, and the falling wall of cubes could pretty much be the canonical “Hello, World” equivalent for a 3D game. I’m not sure how much time I’ll have to pursue my game idea, but it should in the worst case be a route to learning some new technologies.

The demo won’t win any awards for its graphics, and the video is pretty poor quality (I recorded the screen with my camera instead of using any screen capturing software).

Falling cubes

Microsoft seem to be definitely pushing for the use of C++ for game creation with Windows 8, with no announcement of a new version of XNA, but that seems to make sense; C++ is closer to the metal than C# and will get more performance out of lower power ARM tablets. Also, there’s quite a buzz with the C++ renaissance with C++11, and modern-style C++. The corresponding new technologies in VS 2010 (PPL) and VS 11 (auto-vectorization, C++ AMP) are especially cool.

I’m also keen to get back into C++ – the past year or so I’ve been working purely in C#, but all my prior experience was using a mixture of C# and C++, and I do miss C++ (even though I am more productive in C#). Also, my previous roles were generally in a modern style (use of STL, smart pointers etc.) but my most recent experience was “C with Classes” style code, so I’ve been relearning modern style as well as picking up the new C++11 features.

Building Bullet as a static lib was pretty easy, I just had to switch it over to use the Multi-threaded DLL c-runtime. Defining WINAPI_FAMILY=WINAPI_PARTITION_APP to detect any non-metro API usages revealed the usage of GetTickCount, which is unavailable on Metro apps, so I just turned off profiling by defining BT_NO_PROFILE 1 in btQuickprof.h

The Windows 8 samples including a couple of games which are decent guides on using DirectX 11. These demos aren’t supposed to be examples of a whole game engine, and don’t necessarily follow best practices: Marble Maze has a class DirectXBase which encapsulates all rendering logic, which is great, but the game logic is in a class MarbleMaze, which derives from this class. It’s nothing that would trip up an experienced developer, but beginners may follow the example. The Simple3DGame (which also demos DirectX and XAML interop) is better structured. The only strange thing in that example is that the render primitive classes are all C++/CX ref classes. This seems to contradict what I remember Herb Sutter saying in a Channel 9 video about sticking to ISO standard C++ where possible and only using C++/CX at the boundaries.

I’m not sure yet whether I’ll try to continue to create a simple demo of my game idea purely in C++, or create a C++/CX interop layer to facilitate writing the game logic in C# – I’m tempted to do this as I’m more productive in C#, and it’ll probably lead to a deeper understanding of WinRT which will be useful even when creating C# XAML applications.

I could have alternatively stayed purely in C# and used SharpDX, or a game framework such as ANX or Ogre (both of which say will soon be supporting XAML), but I was keen on having control of the whole game, C++ upwards – if after profiling any of the C# code any bottlenecks are found, it’ll be easier to re-implement those parts of the code in C++.

As to my experiences of developing the demo: The new version of C++ definitely feels more productive, and feels closer to C#. I love auto (var), lambdas (pretty much the same), for_each (foreach) and range for, and initializer lists. Also, async programming using PPL tasks feels very similar to .NET’s TPL. I’ve had to change mental gears a bit in thinking again about object ownership instead of relying on the garbage collector, and it took way longer than it should for me to figure out how to add a std::unique_ptr into a vector.

I could share the demo code if anyone would find it useful (I need to tidy it up a little bit, and can’t be bothered otherwise) – just post in the comments!

Messing with Direct2D

As I blogged about previously, the main bottleneck in the oscilloscope code was in the drawing of the charts, in my previous post I modified DynamicDataDisplay to optimise this, but I wanted to make a Metro version of my app, and didn’t want to convert that library over, so I thought I’d look at doing the drawing myself.

First off,  on Windows 7/.NET4 I used the SurfaceInteropHelper  which allows Direct2D to be hosted on a WPF control, to do the drawing, and found that the CPU usage dropped from 70% to just under 40% (the code’s not worth showing here, and my use of Direct2D was pretty simplistic).

For the Windows 8 consumer preview, I created an app that used the SwapChainBackgroundPanel . It wasn’t obvious how to do this in an app created from scratch, so I butchered the SDK sample Metro style DirectX  shooting game sample (XAML)

This control allows Direct2D and XAML to be mixed in a metro app, but the control needs to be at the root element. I can see how this would be OK for an app such as a game where it could make use of overlayed controls, but for an app such as mine where I’d like to have multiple charts drawn using Direct2D it’s not ideal. Also, it seems to force the skeleton of the app to be written in C++ (where it would be nice to have a C# app host the controls, with the C++ –hidden- ).

This was awkward, but not a blocker to implementing the app on Metro… the blocker was to do with accessing the microphone. There are new Media Foundation libraries to do that, which do interoperate with the new C# async features than DirectShow (expose awaitable types):

image

The new version of RX (2.0 beta) has a new CreateAsync method instead of having an overload of Create that takes the async lambda).

The code above is untested, as the blocker is in the new MediaCapture library:

image

There’s no way to create a lossless MediaEncodingProfile from which to extract the raw waveform.

I think I’ll leave this app on the backburner until at least a later beta.

TPL Dataflow

This blog post is coming a couple of weeks late due to my copy of Windows becoming corrupted – and strangely the only things I didn’t have backed up were this Oscilloscope app!

Anyway, last time I took a look at charting performance, but this post will investigate TPL Dataflow.

In my original F# post, I discussed that I felt that the code would have been better implemented in RX, which I then went and did in this post. However, RX isn’t the only recent new technology coming out of Microsoft that deals with streams of data. TPL Dataflow has a confusingly similar overlap with the usages of RX, and this is what its whitepaper has to say:

“Astute readers may notice some similarities between TPL Dataflow and Reactive Extensions (Rx), currently available as a download from the MSDN Data developer center. Rx is predominantly focused on coordination and composition of event streams with a LINQ-based API, providing a rich set of combinators for manipulating IObservable<T>s of data. In contrast, TPL Dataflow is focused on providing building blocks for message passing and parallelizing CPU- and I/O-intensive applications with high-throughput and low-latency, while also providing developers explicit control over how data is buffered and moves about the system. As such, Rx and TPL Dataflow, while potentially viewed as similar at a 30,000 foot level, address distinct needs. Even so, TPL Dataflow and Rx provide a better together story.”

That does sound very interesting – who doesn’t want better performance!

Implementation

I’ll dive in straight away and look at some code. The code is structured similarly to the RX code in my previous post.

clip_image001

The microphone still provides an IObservable<float[]> via the GetStream() method. The TaskSchedulers are specified explicitly throughout, so that the code can be unit tested (I’m using a MockSynchronizationContext).

The transformManyBlock behaves in the same way to the SelectMany operator, it separates the float array into its constituent parts.

The broadcast block is analogous to a connectable observable – many blocks can be linked to it to receive the same messages.

The transform many block is then explicitly linked to the broadcast block, meaning that its outputs make it into the broadcast block’s input.

Oscilloscope/Buffered View

The code to deal with the buffered display of data is:

clip_image002

The batch block behaves the same as LINQ’s Buffer operator (a bit confusingly when TPL Dataflow’s Buffer means something different). This simply takes the incoming float values and packages them into an array.

The ActionBlock is the final point on the chain – it takes the values it receives and outputs them to the screen (via oscilloscopeOutput) – this is the equivalent code to the code in the Observable Subscription in the RX version of my code.

Sliding Window View

Again, the sliding window view is structured similarly to the previous RX code:

clip_image003

A sample block is created so that every n samples are fed into the SlidingWindowBlock. This returns a new float[] array representing the new window of data on every input.

The second sample block ensures that not every window of data is drawn (so that the chart is refreshed at 40fps rather than 400fps!)

Finally, the observable is subscribed to, and the messages are posted into the transformManyBlock:

clip_image004

Whereas the RX guidelines recommend creating new operators based on existing operators, TPL Dataflow encourages building up custom blocks (though they can be built out of combining existing blocks).

The SlidingWindowBlock is a slightly modified version of the one provided in the white paper, and here (http://msdn.microsoft.com/en-us/library/hh228606(v=vs.110).aspx) The one in the example only posts a window once it has received a full window worth of data, whereas I want to start drawing the windows as soon as data arrives.

clip_image005

The SampleBlock is simple, it yields a value once it has received enough inputs, discarding the intermediate ones. This is structured more efficiently than my RX implementation – that one buffers the values, and then only take the last value out of the buffer.

clip_image006

Performance

I removed the updating of the chart, to look purely at what the performance of the app is in taking the input from the microphone, and shaping it to be in the correct format for output.

RX Performance:

clip_image007

TPL Dataflow:

clip_image008

The processor usage between the apps is in the same ballpark, but the TPL Dataflow app has many more garbage collections, and # bytes in all heaps also is running slightly higher. I had a quick look in the profiler, and it seems that there are many Task allocations from the Dataflow.Post() operation from within the SlidingWindow and Sample implementations.

The extra generation 1 collections don’t really matter for this application with the functionality that it currently has – the % time in GC barely registers, and there are no Gen 2 garbage collections over the minute or so that the app was running.

Once the app is doing more calculations and drawing more complex charts it would be interesting to see whether any latency spikes due gen 2 GCs cause similar slowdowns to the one discussed at the start of my previous post.

It would be an interesting exercise to limit the amount of GCs throughout the app, for instance there’s no need for the microphone access observable to return an IObservable<float[]> instead of IObservable<float>; currently every read from the microphone allocates a new float[]. Similarly, new List<Point> are created to more easily interface with DynamicDataDisplay – it would be better to change the types of data that D3 takes to be more observable-friendly, and to save having so many allocations. Again, there’s not much point doing this, other than an interesting theoretical exercise, until the garbage collection overhead proves to be a performance issue.

Conclusion

For an application as simple as this, there isn’t any benefit to using TPL Dataflow – it is a powerful library, with functionality such as blocks being able to decline offered blocks, and request them later, which would be difficult to implement in RX. As my app doesn’t (currently) need that level of functionality, there’s no benefit to using the library.

I may revisit this in the future – if I had some computationally expensive operation (FFT for instance) where I’d want greater control over the flow of data through the system.

Charting Performance

I wasn’t going to blog about performance until a later post, but there was a performance issue I wanted to tackle before discussing some other things.

The frame rate sometimes seemed to be as expected, but sometimes slow, and the application was quite unresponsive when handling the Stop button command.

To help nail this down I created a Buffer Frame Rate performance counter, so that I monitor it along with another couple of metrics (I chose to see the count of gen1 and gen2 garbage collections, along with the % time in GC, and % processor time).

image

The framerate average isn’t bad, but the frame rate is not consistent; it regularly drops below 10 frames per second. This post will discuss optimising the performance of the application to keep the frame rate steady and ensure that the app is responsive.

Performance increases should generally focus on reducing unnecessary work in the whole application, and not micro-optimisations. In this case, it’s realising that the top chart is drawing at 40 frames per second (a buffer of 1000 data points into a 40KHz source), whereas the bottom chart is updating at 400 frames per second (sampling the data to 100 points, and updating the chart on each of those sample points).

However, as there are other types of charts I want to produce it’s worth investigating where the time is spent.

First off, I tried turning off anti-aliasing, and moving over to the d3future project (http://d3future.codeplex.com ), to see if that made any performance difference, but the performance was pretty similar, with the CPU pegged at 100%. A further run generated this interesting trace:

image

The frame count starts initially high, but after a while performance drops off dramatically, which is followed by some long spikes in the % Time in GC. The drop-off maybe seems to correspond with one of the gen 2 collections. This situation wasn’t so reproducible that I could catch it under a profiler.

My next step was to profile to see where the application was spending its time, just using the Visual Studio profiler. The application’s hot path was in the D3 LineChart’s UpdateCore method, where it was updating the chart’s Bounds using data from the EnumerableDataSource –it was iterating over the whole collection to find the x and y min and max values (to fit the chart to the data area).

It seems unnecessary to iterate over the whole data source to get the minimum and maximum x values – for a linear chart (i.e. not a scatter chart) it would be expected that the first point is at the minimum x value and the last point the maximum.

I created a chart derived from LineChart where I could instead pass in a List of points for the datasource – this means that I could get the last point without enumerating the whole collection, allowing the min and max x values to be quickly found. For the y values, I happen to want the chart to be a fixed axis and not scale to fit the data (instead scaled by a user-configurable gain factor), so there was no need at all to iterate over the collection.

The chart looked like this after those changes:

image

The frame rate is pretty consistent now. The gen 1 allocations are still occurring at a rate of 200 per minute, but the application is now responsive to the stop click, stopping immediately.

Profiling with memory allocations turned on showed that the types with most instances was System.Byte[] with 32% of the instances allocated in the profiling period.

ConvertByteArrayToFloatArray() previously looked like this:

image

And after changing this to not use LINQ:

image

Gives the following:

image

Performance is pretty similar to previous, but the gen 1 collections are occurring at a slightly lower rate. Now, profiling tells me that the types with most instances allocated is pretty much split between System.Action, System.Object, and System.Reactive.Disposables.SingleAssignmentDisposable.

Profiling at this point identifies the majority of the work now being in LineGraph.UpdateCore() where its transforming all points from their data coordinates into screen coordinates.

The transformation of coordinates for many points is embarrassingly parallel (think GPGPU). Potentially, the GPU could be put to use by simply drawing the points untransformed, adding a transformation onto the drawing context.

Further performance increases could be done for the sliding window chart: A sliding window chart redisplays the same points again and again, as they move left along the x axis. It essentially transforms the same point many times into the new viewport. Doing a transform to move an existing point to the left could be more efficient than transforming again from data to screen coordinates.

This blog so far has looked at some micro-optimisations focussing on identifying areas where work can be reduced. As I said in the beginning, if I wasn’t so interested in optimising for future requirements, I’d simply ensure that I was doing less drawing straight away.

Anyway, the sliding observable can be change to only take every 10th windowed set of values, by introducing another Sample() on the result of the WindowWithCount():

image

There isn’t actually that much difference in real-world performance.

image

The rate of gen 1 collections is lower still than previous, but the profiler shows that app is still spending the majority of its time in drawing the charts. Surprisingly the CPU usage doesn’t seem to have dropped much, although in the profiler the time can now be seen to be split between the drawing of the two charts, instead of dominated by the drawing of the sliding window chart.

The micro-optimisations were still worth investigating, as I have charts in mind that will draw multiple traces simultaneously.