Charting Performance

I wasn’t going to blog about performance until a later post, but there was a performance issue I wanted to tackle before discussing some other things.

The frame rate sometimes seemed to be as expected, but sometimes slow, and the application was quite unresponsive when handling the Stop button command.

To help nail this down I created a Buffer Frame Rate performance counter, so that I monitor it along with another couple of metrics (I chose to see the count of gen1 and gen2 garbage collections, along with the % time in GC, and % processor time).

image

The framerate average isn’t bad, but the frame rate is not consistent; it regularly drops below 10 frames per second. This post will discuss optimising the performance of the application to keep the frame rate steady and ensure that the app is responsive.

Performance increases should generally focus on reducing unnecessary work in the whole application, and not micro-optimisations. In this case, it’s realising that the top chart is drawing at 40 frames per second (a buffer of 1000 data points into a 40KHz source), whereas the bottom chart is updating at 400 frames per second (sampling the data to 100 points, and updating the chart on each of those sample points).

However, as there are other types of charts I want to produce it’s worth investigating where the time is spent.

First off, I tried turning off anti-aliasing, and moving over to the d3future project (http://d3future.codeplex.com ), to see if that made any performance difference, but the performance was pretty similar, with the CPU pegged at 100%. A further run generated this interesting trace:

image

The frame count starts initially high, but after a while performance drops off dramatically, which is followed by some long spikes in the % Time in GC. The drop-off maybe seems to correspond with one of the gen 2 collections. This situation wasn’t so reproducible that I could catch it under a profiler.

My next step was to profile to see where the application was spending its time, just using the Visual Studio profiler. The application’s hot path was in the D3 LineChart’s UpdateCore method, where it was updating the chart’s Bounds using data from the EnumerableDataSource –it was iterating over the whole collection to find the x and y min and max values (to fit the chart to the data area).

It seems unnecessary to iterate over the whole data source to get the minimum and maximum x values – for a linear chart (i.e. not a scatter chart) it would be expected that the first point is at the minimum x value and the last point the maximum.

I created a chart derived from LineChart where I could instead pass in a List of points for the datasource – this means that I could get the last point without enumerating the whole collection, allowing the min and max x values to be quickly found. For the y values, I happen to want the chart to be a fixed axis and not scale to fit the data (instead scaled by a user-configurable gain factor), so there was no need at all to iterate over the collection.

The chart looked like this after those changes:

image

The frame rate is pretty consistent now. The gen 1 allocations are still occurring at a rate of 200 per minute, but the application is now responsive to the stop click, stopping immediately.

Profiling with memory allocations turned on showed that the types with most instances was System.Byte[] with 32% of the instances allocated in the profiling period.

ConvertByteArrayToFloatArray() previously looked like this:

image

And after changing this to not use LINQ:

image

Gives the following:

image

Performance is pretty similar to previous, but the gen 1 collections are occurring at a slightly lower rate. Now, profiling tells me that the types with most instances allocated is pretty much split between System.Action, System.Object, and System.Reactive.Disposables.SingleAssignmentDisposable.

Profiling at this point identifies the majority of the work now being in LineGraph.UpdateCore() where its transforming all points from their data coordinates into screen coordinates.

The transformation of coordinates for many points is embarrassingly parallel (think GPGPU). Potentially, the GPU could be put to use by simply drawing the points untransformed, adding a transformation onto the drawing context.

Further performance increases could be done for the sliding window chart: A sliding window chart redisplays the same points again and again, as they move left along the x axis. It essentially transforms the same point many times into the new viewport. Doing a transform to move an existing point to the left could be more efficient than transforming again from data to screen coordinates.

This blog so far has looked at some micro-optimisations focussing on identifying areas where work can be reduced. As I said in the beginning, if I wasn’t so interested in optimising for future requirements, I’d simply ensure that I was doing less drawing straight away.

Anyway, the sliding observable can be change to only take every 10th windowed set of values, by introducing another Sample() on the result of the WindowWithCount():

image

There isn’t actually that much difference in real-world performance.

image

The rate of gen 1 collections is lower still than previous, but the profiler shows that app is still spending the majority of its time in drawing the charts. Surprisingly the CPU usage doesn’t seem to have dropped much, although in the profiler the time can now be seen to be split between the drawing of the two charts, instead of dominated by the drawing of the sliding window chart.

The micro-optimisations were still worth investigating, as I have charts in mind that will draw multiple traces simultaneously.

Oscilloscope using RX and C# Async CTP

In my last blog post I described the implementation of a simple ‘oscilloscope app’ in F#, as I wanted to see how the code would be structured using only F# idioms (http://taumuon-jabuka.blogspot.com/2012/01/visualising-sound.html )

My natural instinct would have been to implement it using the Reactive Extensions for .NET (RX Framework), but I first wanted to investigate a pure F# solution. This post describes an alternate RX implementation.

Similar to my last post, I’ll describe the code inside-out.

Creating the Observable

clip_image002

This code returns an IObservable with a float array payload, representing each read from the CaptureBuffer. The implementation could have internally started a dedicated Thread from which to push out values, but instead I’m using the new C# 5 async functionality (using the Async CTP Update 3), so that my code looks pretty similar to the previous F# example.

The observable takes a CancellationToken, whose IsCancellationRequested is set to true once all subscriptions to the observable have been disposed. The CompositeDisposable returned out from the lambda is also disposed of at that point.

The code loops while its CancellationToken has not been cancelled, asynchronously awaiting for one of the WaitHandles to be set, and then it reads from the buffer. The value is pushed out to subscribers in the OnNext.

The ConvertByteArrayToFloatArray() is trivial:

clip_image003

Subscriptions to the observable

clip_image004

First off, the float array returned from the observable is decomposed into the individual values, using a SelectMany, as sampling, buffering and windowing operations all operate on a stream of floats. Then, the observable is published to an IConnectableObservable. The microphone access returns a Cold observable, meaning that each subscriber to it would end up creating their own microphone access. This would work (the CaptureBuffer doesn’t prevent this), but the connectable observable means that instead all clients share the same observable, and ensures that they all see the same values (so that the traces on the two charts are in sync).

The RefCount() means that when all subscriptions to the observable IConnectableObservable variable have been disposed of then the subscription to the underlying observable will also be disposed.

clip_image005

The top ‘oscilloscope’ trace is a simple Observable.Buffer() over the data stream. There is no need to ObserveOn the dispatcher thread as the subscription occurs on the UI thread. Using RX, it would be easy to schedule various parts of the work onto different threads, but I’ll discuss this in a later article (I want to keep everything on the UI thread for now to compare performance with the single threaded F# implementation).

All subscriptions are added to a CompositeDisposable member in the ViewModel – the Stop button’s command implementation disposes of this, which causes all subscriptions to be disposed of, and so the microphone access loop to be terminated via its CancellationToken.

clip_image006

The windowing operation simply samples every 100 data points, and from those sampled data points takes a sliding window of 1000 values to display. The windowCount variable is closed over to allow the y axis to be continually updated.

clip_image007

The Sample operator is simple, but not particularly efficient – it takes a buffer (i.e a non-overlapping window) of count values, and then takes the last value in the buffer.

The WindowWithCount operator is the same one I discussed at http://taumuon-jabuka.blogspot.com/2011/07/rx-framework-performance-awareness.html (with implementation grabbed from http://social.msdn.microsoft.com/Forums/en-US/rx/thread/37428f58-f241-45b3-a878-c1627deb9ac4#bcdc7b79-bbde-4145-88e4-583685285682 )

clip_image008

And as I also talked about in that post, the RX guidelines recommend implementing an operator in terms of existing operator. There’s only one problem in this case, it’s quite slow, (I’ll get quantitative figures on this in a future blog post discussing performance of all approaches).

The following is faster (again, I know more specific figures are needed):

clip_image009

Comparing implementations

For me, the RX solution is cleaner than the F# solution, and easier to follow. I did implement my F# solution in quite an imperative way though, and should have perhaps used AsyncSeq or mailbox processors, but as reading from the microphone is a push-based activity, none of those solutions would be as clean as RX (of course I haven’t covered using RX in F#). The F# version is much faster, and I’ll take a big more of a dig into performance in an upcoming blog post.

Visualising Sound

This is the first in a series of blog posts about visualising sound, in a similar way to an oscilloscope. The geek in me thought it’d be a fun thing to do, as well investigate different technological approaches to implementing the app (interesting as it’s a real-time performant app).

clip_image001

An oscilloscope has a single screen, which refreshes on a given time period, displaying a number of traces. The user can control that time period, as well as the gain (the y axis scale).

The top graph is essentially the same view as an oscilloscope containing a single trace, and a fixed time period (further blog posts may investigate varying time periods).

The bottom graph is a sliding window with a longer time period – this is the advantage of implementing an oscilloscope in code, we can create charts that aren’t really feasible in a classic CRT oscilloscope.

This series of blog posts will investigate implementing this screen using F# idioms, as well as using the Reactive Extension for .NET (RX Framework), and TPL Dataflow.

There are further things that could be implemented in future blog posts which may be interesting to see how the varying approaches look like:

· Trigger, or Trigger and hold: Only refresh the oscilloscope view once a certain trigger level has been reached. This is interesting as may want to include a certain number of immediately prior to the point where the trigger was set.

· Log many traces.

· Spectrum analyser/FFT.

· Digital filtering.

· Comparing traces, or more complicated math channels.

· Heatmap.

· Adjustable time offset (delay) – useful on the oscilloscope view to centre a waveform on the screen, or for when comparing two or more channels output.

F# Implementation

This blog post covers an F# implementation of the microphone, using asynchronous workflows. The graphing is being done by a nightly build download of Dynamic Data Display (D3). I’m really impressed with the performance, but it is quite difficult to customise.

The low-latency microphone code is from this article on gamedev (http://www.gamedev.net/topic/586706-slimdx-directsound-capture-microphone-stream-with-low-latency-example/).

The implementation of this is pretty simple; I’ll start from the detail out.

The inner loop of the program is an asynchronous workflow that reads from the buffer and returns a sequence:

clip_image003

Note the slightly-strange looking code:

let task = System.Threading.Tasks.Task<int>.Factory.StartNew(fun () -> WaitHandle.WaitAny(waitHandles))

F# has an Async.AwaitWaitHandle() method, which unfortunately only waits on a single handle. We want to wait on both handles so that we get notified when the buffer is full every 4096 instead of every 8192 bytes. With 4 bytes per sample, and a sample rate off 44KHz, this is equivalent to getting notified at an approximate rate of 40 times per second instead of 20 times per second.

I could have implemented Async.AwaitAnyWaitHandle() taking an array of WaitHandles, but looking at the code in the F# PowerPack, the code was quite complex. So, the code instead creates a new future to do the waiting and let us know which WaitHandle was set (this does mean that we’ve got the minor overhead of scheduling a new task to run on the task pool).

The Async.StartImmediate method ensures that the ProcessStream call is marshalled back onto the UI thread. It may be worth in the future looking at doing more of the data processing on a dedicated thread, leaving the UI thread free for drawing and user input.

The convertByteArrayToSequence is simple, it just iterates over the buffer in 4 byte chunks, and converts the values to floats, which it yields in the sequence:

clip_image004

The ProcessStream method looks like this:

clip_image006

For completeness, this is the Seq.sample module extension:

clip_image007

The nastiest bit of the code in ProcessStream is probably towards the end, where the windowedUnderlyingData is manipulated to ensure that the window only contains 1000 samples. It would be nice to do this in a non-imperative way, using Seq.windowed, but the problem is, is that the sequence we’re operating on is only the result of one buffer read operation, whereas windows etc. should operate over the whole data stream, and the sequences can’t be combined into a single sequence using yield! as they’re all generated asynchronously. Similarly, the buffer takes non-overlapping windows over the input sequence, and without taking a buffer over the whole stream, it may miss samples off the end of the sequence portions. Tomas Petricek has written a library, AsyncSeq, http://tomasp.net/blog/async-sequences.aspx which I may investigate in a later post.

The alternative to this would be to implement mailbox processors, having different agents for the buffering, windowing and sampling operations. I did start investigating this, but didn’t feel happy with them being pull-based (they don’t return data unless asked). I could have set them up to act more like a dataflow network, but it does seem to go against their intended use of managing state between different threads. I may revisit this in a future blog post.

I feel that even though F#’s does have nice features which helped to quickly implement the app, RX would probably be a better fit. I guess I won’t know until I implement it in RX and compare the differences.