## Charting Performance

I wasn’t going to blog about performance until a later post, but there was a performance issue I wanted to tackle before discussing some other things.

The frame rate sometimes seemed to be as expected, but sometimes slow, and the application was quite unresponsive when handling the Stop button command.

To help nail this down I created a Buffer Frame Rate performance counter, so that I monitor it along with another couple of metrics (I chose to see the count of gen1 and gen2 garbage collections, along with the % time in GC, and % processor time).

The framerate average isn’t bad, but the frame rate is not consistent; it regularly drops below 10 frames per second. This post will discuss optimising the performance of the application to keep the frame rate steady and ensure that the app is responsive.

Performance increases should generally focus on reducing unnecessary work in the whole application, and not micro-optimisations. In this case, it’s realising that the top chart is drawing at 40 frames per second (a buffer of 1000 data points into a 40KHz source), whereas the bottom chart is updating at 400 frames per second (sampling the data to 100 points, and updating the chart on each of those sample points).

However, as there are other types of charts I want to produce it’s worth investigating where the time is spent.

First off, I tried turning off anti-aliasing, and moving over to the d3future project (http://d3future.codeplex.com ), to see if that made any performance difference, but the performance was pretty similar, with the CPU pegged at 100%. A further run generated this interesting trace:

The frame count starts initially high, but after a while performance drops off dramatically, which is followed by some long spikes in the % Time in GC. The drop-off maybe seems to correspond with one of the gen 2 collections. This situation wasn’t so reproducible that I could catch it under a profiler.

My next step was to profile to see where the application was spending its time, just using the Visual Studio profiler. The application’s hot path was in the D3 LineChart’s UpdateCore method, where it was updating the chart’s Bounds using data from the EnumerableDataSource –it was iterating over the whole collection to find the x and y min and max values (to fit the chart to the data area).

It seems unnecessary to iterate over the whole data source to get the minimum and maximum x values – for a linear chart (i.e. not a scatter chart) it would be expected that the first point is at the minimum x value and the last point the maximum.

I created a chart derived from LineChart where I could instead pass in a List of points for the datasource – this means that I could get the last point without enumerating the whole collection, allowing the min and max x values to be quickly found. For the y values, I happen to want the chart to be a fixed axis and not scale to fit the data (instead scaled by a user-configurable gain factor), so there was no need at all to iterate over the collection.

The chart looked like this after those changes:

The frame rate is pretty consistent now. The gen 1 allocations are still occurring at a rate of 200 per minute, but the application is now responsive to the stop click, stopping immediately.

Profiling with memory allocations turned on showed that the types with most instances was System.Byte[] with 32% of the instances allocated in the profiling period.

ConvertByteArrayToFloatArray() previously looked like this:

And after changing this to not use LINQ:

Gives the following:

Performance is pretty similar to previous, but the gen 1 collections are occurring at a slightly lower rate. Now, profiling tells me that the types with most instances allocated is pretty much split between System.Action, System.Object, and System.Reactive.Disposables.SingleAssignmentDisposable.

Profiling at this point identifies the majority of the work now being in LineGraph.UpdateCore() where its transforming all points from their data coordinates into screen coordinates.

The transformation of coordinates for many points is embarrassingly parallel (think GPGPU). Potentially, the GPU could be put to use by simply drawing the points untransformed, adding a transformation onto the drawing context.

Further performance increases could be done for the sliding window chart: A sliding window chart redisplays the same points again and again, as they move left along the x axis. It essentially transforms the same point many times into the new viewport. Doing a transform to move an existing point to the left could be more efficient than transforming again from data to screen coordinates.

This blog so far has looked at some micro-optimisations focussing on identifying areas where work can be reduced. As I said in the beginning, if I wasn’t so interested in optimising for future requirements, I’d simply ensure that I was doing less drawing straight away.

Anyway, the sliding observable can be change to only take every 10th windowed set of values, by introducing another Sample() on the result of the WindowWithCount():

There isn’t actually that much difference in real-world performance.

The rate of gen 1 collections is lower still than previous, but the profiler shows that app is still spending the majority of its time in drawing the charts. Surprisingly the CPU usage doesn’t seem to have dropped much, although in the profiler the time can now be seen to be split between the drawing of the two charts, instead of dominated by the drawing of the sliding window chart.

The micro-optimisations were still worth investigating, as I have charts in mind that will draw multiple traces simultaneously.

## Oscilloscope using RX and C# Async CTP

In my last blog post I described the implementation of a simple ‘oscilloscope app’ in F#, as I wanted to see how the code would be structured using only F# idioms (http://taumuon-jabuka.blogspot.com/2012/01/visualising-sound.html )

My natural instinct would have been to implement it using the Reactive Extensions for .NET (RX Framework), but I first wanted to investigate a pure F# solution. This post describes an alternate RX implementation.

Similar to my last post, I’ll describe the code inside-out.

### Creating the Observable

This code returns an IObservable with a float array payload, representing each read from the CaptureBuffer. The implementation could have internally started a dedicated Thread from which to push out values, but instead I’m using the new C# 5 async functionality (using the Async CTP Update 3), so that my code looks pretty similar to the previous F# example.

The observable takes a CancellationToken, whose IsCancellationRequested is set to true once all subscriptions to the observable have been disposed. The CompositeDisposable returned out from the lambda is also disposed of at that point.

The code loops while its CancellationToken has not been cancelled, asynchronously awaiting for one of the WaitHandles to be set, and then it reads from the buffer. The value is pushed out to subscribers in the OnNext.

The ConvertByteArrayToFloatArray() is trivial:

### Subscriptions to the observable

First off, the float array returned from the observable is decomposed into the individual values, using a SelectMany, as sampling, buffering and windowing operations all operate on a stream of floats. Then, the observable is published to an IConnectableObservable. The microphone access returns a Cold observable, meaning that each subscriber to it would end up creating their own microphone access. This would work (the CaptureBuffer doesn’t prevent this), but the connectable observable means that instead all clients share the same observable, and ensures that they all see the same values (so that the traces on the two charts are in sync).

The RefCount() means that when all subscriptions to the observable IConnectableObservable variable have been disposed of then the subscription to the underlying observable will also be disposed.

The top ‘oscilloscope’ trace is a simple Observable.Buffer() over the data stream. There is no need to ObserveOn the dispatcher thread as the subscription occurs on the UI thread. Using RX, it would be easy to schedule various parts of the work onto different threads, but I’ll discuss this in a later article (I want to keep everything on the UI thread for now to compare performance with the single threaded F# implementation).

All subscriptions are added to a CompositeDisposable member in the ViewModel – the Stop button’s command implementation disposes of this, which causes all subscriptions to be disposed of, and so the microphone access loop to be terminated via its CancellationToken.

The windowing operation simply samples every 100 data points, and from those sampled data points takes a sliding window of 1000 values to display. The windowCount variable is closed over to allow the y axis to be continually updated.

The Sample operator is simple, but not particularly efficient – it takes a buffer (i.e a non-overlapping window) of count values, and then takes the last value in the buffer.

The WindowWithCount operator is the same one I discussed at http://taumuon-jabuka.blogspot.com/2011/07/rx-framework-performance-awareness.html (with implementation grabbed from http://social.msdn.microsoft.com/Forums/en-US/rx/thread/37428f58-f241-45b3-a878-c1627deb9ac4#bcdc7b79-bbde-4145-88e4-583685285682 )

And as I also talked about in that post, the RX guidelines recommend implementing an operator in terms of existing operator. There’s only one problem in this case, it’s quite slow, (I’ll get quantitative figures on this in a future blog post discussing performance of all approaches).

The following is faster (again, I know more specific figures are needed):

### Comparing implementations

For me, the RX solution is cleaner than the F# solution, and easier to follow. I did implement my F# solution in quite an imperative way though, and should have perhaps used AsyncSeq or mailbox processors, but as reading from the microphone is a push-based activity, none of those solutions would be as clean as RX (of course I haven’t covered using RX in F#). The F# version is much faster, and I’ll take a big more of a dig into performance in an upcoming blog post.

## Visualising Sound

This is the first in a series of blog posts about visualising sound, in a similar way to an oscilloscope. The geek in me thought it’d be a fun thing to do, as well investigate different technological approaches to implementing the app (interesting as it’s a real-time performant app).

An oscilloscope has a single screen, which refreshes on a given time period, displaying a number of traces. The user can control that time period, as well as the gain (the y axis scale).

The top graph is essentially the same view as an oscilloscope containing a single trace, and a fixed time period (further blog posts may investigate varying time periods).

The bottom graph is a sliding window with a longer time period – this is the advantage of implementing an oscilloscope in code, we can create charts that aren’t really feasible in a classic CRT oscilloscope.

This series of blog posts will investigate implementing this screen using F# idioms, as well as using the Reactive Extension for .NET (RX Framework), and TPL Dataflow.

There are further things that could be implemented in future blog posts which may be interesting to see how the varying approaches look like:

· Trigger, or Trigger and hold: Only refresh the oscilloscope view once a certain trigger level has been reached. This is interesting as may want to include a certain number of immediately prior to the point where the trigger was set.

· Log many traces.

· Spectrum analyser/FFT.

· Digital filtering.

· Comparing traces, or more complicated math channels.

· Heatmap.

· Adjustable time offset (delay) – useful on the oscilloscope view to centre a waveform on the screen, or for when comparing two or more channels output.

### F# Implementation

This blog post covers an F# implementation of the microphone, using asynchronous workflows. The graphing is being done by a nightly build download of Dynamic Data Display (D3). I’m really impressed with the performance, but it is quite difficult to customise.

The implementation of this is pretty simple; I’ll start from the detail out.

The inner loop of the program is an asynchronous workflow that reads from the buffer and returns a sequence:

Note the slightly-strange looking code:

F# has an Async.AwaitWaitHandle() method, which unfortunately only waits on a single handle. We want to wait on both handles so that we get notified when the buffer is full every 4096 instead of every 8192 bytes. With 4 bytes per sample, and a sample rate off 44KHz, this is equivalent to getting notified at an approximate rate of 40 times per second instead of 20 times per second.

I could have implemented Async.AwaitAnyWaitHandle() taking an array of WaitHandles, but looking at the code in the F# PowerPack, the code was quite complex. So, the code instead creates a new future to do the waiting and let us know which WaitHandle was set (this does mean that we’ve got the minor overhead of scheduling a new task to run on the task pool).

The Async.StartImmediate method ensures that the ProcessStream call is marshalled back onto the UI thread. It may be worth in the future looking at doing more of the data processing on a dedicated thread, leaving the UI thread free for drawing and user input.

The convertByteArrayToSequence is simple, it just iterates over the buffer in 4 byte chunks, and converts the values to floats, which it yields in the sequence:

The ProcessStream method looks like this:

For completeness, this is the Seq.sample module extension:

The nastiest bit of the code in ProcessStream is probably towards the end, where the windowedUnderlyingData is manipulated to ensure that the window only contains 1000 samples. It would be nice to do this in a non-imperative way, using Seq.windowed, but the problem is, is that the sequence we’re operating on is only the result of one buffer read operation, whereas windows etc. should operate over the whole data stream, and the sequences can’t be combined into a single sequence using yield! as they’re all generated asynchronously. Similarly, the buffer takes non-overlapping windows over the input sequence, and without taking a buffer over the whole stream, it may miss samples off the end of the sequence portions. Tomas Petricek has written a library, AsyncSeq, http://tomasp.net/blog/async-sequences.aspx which I may investigate in a later post.

The alternative to this would be to implement mailbox processors, having different agents for the buffering, windowing and sampling operations. I did start investigating this, but didn’t feel happy with them being pull-based (they don’t return data unless asked). I could have set them up to act more like a dataflow network, but it does seem to go against their intended use of managing state between different threads. I may revisit this in a future blog post.

I feel that even though F#’s does have nice features which helped to quickly implement the app, RX would probably be a better fit. I guess I won’t know until I implement it in RX and compare the differences.

## Family tree timelines

As I talked about in my previous post, I created a simple family tree visualisation program to let me know which of my ancestors I have the most interest for.

I also prototyped up a timeline view – in the above image. It’s probably pretty obvious that it’s a prototype, as there are no labels to identify any of the individuals.

The idea for this came about as genesreunited has a non-validated freeform text field for all entries of dates. There’s no way to validate whether this information is valid for GEDCOM export – the GEDCOM spec lets various dates, such as approximate dates, bounded dates, dates within a specific quarter etc. to be specified but obviously the date has to be specified in the correct format.

I wanted to check that all dates were both in the correct format, and actually valid (i.e. check for nonsensical dates such as parents born after their children etc).

The idea behind the colouring is for green to show a GEDCOM format valid date, yellow to specify a missing date (with the date inferred), and red to indicate an invalid date.

The opacity is to indicate the ‘confidence’ in a date – with a specified date range not being at full opacity. Also, the dates can be inferred using a set of rules (e.g. parents should be at least 12 years older than their children, a child is probably born within a year of his or her baptism, the first child is born within a year of his or her parents marriage etc.). These rules could obviously get quite complicated.

The layout matches pretty much the layout of the ancestor chart, with ancestors being adjacent to each other. I was fussing over this a little bit – I thought it’d be nice for the descendents on the chart to be nearest to each other on the x-axis), but there’s no way for this to happen through the tree.

I was feeling pretty happy with this, and thought that it’s probably worth putting into the app (with some tweakes such as making the y-axis an adorner layer that adjusts with scale etc), but then I found that there’s some software which has a pretty-much identical view to this (I did google around for this before implementing, I obviously didn’t google hard enough).

Progeny Genealogy has an identical layout (OK, turned on its side). It even has the same idea of using opacity to indicate what data is estimated (not varying opacity, but even still). I guess that there’s only so many ways to solve a problem, but it’s still gutting when you think you’ve had an original idea!

## Windows 8 Metro Development Experience

I’ve been running the Windows 8 developer preview for a few weeks, and thought I’d blog about my experiences in converting a Silverlight app I’ve written into a Metro one. I’ll first describe the app, then my experiences developing it.

### The App

The application is a family tree viewing application. It came about as I’ve found a few branches my family tree back to the 1600s (with help hooking up with other people on GenesReunited). It isn’t very friendly if you want to see the whole tree as it shuffles all of the individuals around to minimise the screen real estate, which is good for printing, but not so good to make sense of the tree. I was interested in seeing which of my ancestors I had the most history for, and which I wanted to research next, so I created an app to show ancestors without siblings, and to favour clarity over screen real estate.

(When I started writing this app, there didn’t seem to be any family tree apps which showed the whole family tree. I’ve since found that Gramps uses GraphViz to do exactly that).

The screenshot above shows that Metro apps really don’t much unnecessary chrome. Any infrequently used commands are hidden in the app bar at the bottom of the screen. The next screenshot shows the app at full zoom with the application bar hidden.

### Development

Now I’ve laid out the background, I can talk about the development experience, and some of that will include talking about Windows 8. First off, Metro’s UI is a bit toy-ish, but I can see that it will feel slick on a touchscreen (I’ve got a Windows Phone 7 device, and am struck by the similarities). The green is pretty garish thoughout, but I’ve attempted to follow this look in my app.

#### UI and controls

It doesn’t feel as if Microsoft has paid enough attention to using Metro apps with the mouse and keyboard yet. A case in point is in the zooming in this application – I’ve provided a slider in the search bar, but it’s unclear whether there will be a system-wide gesture mapping to pinch-to-zoom in the release. This was necessary for testing, as even though Visual Studio has a simulator which can be used to simulate gestures, the zoom gesture uses the mousewheel, which my laptop does not have.

The actual conversion was surprisingly simple (they promised that this would be the case on the Build videos), but I’m glad that I didn’t hit any blockers. My ViewModels needed minor changes (which I’ll talk about below), and most of my Xaml just got moved across.

The ScrollViewer now allows zooming and panning, so I was able to use this instead of my own ZoomPanControl. Strangely, the zooming is performed by a method instead of a dependency property (making it easier to animate zooms etc). Also, even though it’s drawing visuals, the zoom seems to take a bitmap at full zoom and simply scale that down (this is speculation on my part, but to my eyes the scaled content looks a lot worse than a WPF version which uses scale transforms). Here’s a WPF version of my app at similar zoom:

The text is almost readable, and the connecting lines aren’t suffering from the ugly aliasing.

There are some further niggles, it seems that the ScrollViewer has a minimum zoom factor of 0.1. Additionally, the slider control, even though it was set with a minimum of 0.05 and maximum of 1, would only display values of 0 and 1 on the popup.

#### Coding

Now I’ve described some of the structural changes, I can describe some of the coding changes. Most of the changes were pretty trivial in the ViewModel – apart from the use of async. The Silverlight of the app was using the Reactive Framework. Other than reading a couple of articles, I hadn’t gotten into looking into C# 5 async yet, but it was pretty trivial to switch over. Testing it is another matter – I ended up downloading the Async CTP to see their unit testing sample, and found that it was the most complex sample throughout. The Reactive Framework allows you to simply control Virtual time using the scheduler, and I haven’t seen anything so simple or elegant for C# 5 async (though testing Observable.FromAsyncPattern methods are similarly tricky to test, as they don’t use the scheduler requested; relying on the underlying IO Completion ports for scheduling the work).

 I originally blogged that I was concerned that there wouldn’t be a version of the Reactive Framework for .NET 4.5/WinRT, following some forum rumours. However, the guys have a .NET 4.5 RX build ready.

Other changes in the code mapped naturally across – the FileDialog now returns an IInputStream, but this has a method AsString() to map across to a .NET stream. The List class has had a few of its methods (such as the Sort overload taking a delegate) removed, annoyingly.

Also, now that asynchronous calls are so pervasive, I’m surprised that the Silverlight Toolkit BusyIndicator didn’t make it in.

#### Visual Studio

First, the unit testing tool is definitely pre-beta. I didn’t actually finish investigating how to unit test my C# 5 async method conversions, as I couldn’t stomach using the tool any longer. I’m also not overly-enamoured with the new Find dialog. Otherwise, it seems to be pretty stable.

### Metro-ising

So far I’ve spoken about how easy it was to convert over a Silverlight application to Metro, but Metro does allow many compelling features to be very easily added to the application. Charms could be provided to allow searching of family tree information, and to allow the images to be easily shared.

Once I get hold of a touchscreen device, I’d love to add some snap points to the chart.

### Other thoughts

Overall, I’m pleased with the development experience of targeting Windows 8 Metro. I haven’t spoken about the actual platform in this blog, but after the months of silence and confusion about Windows 8, it’s all good news. .NET developers are still first class citizens.

I’m happy that WinRT is back to native code, and as I have experience in C++/CLI I’m very happy that C++/CX seems pretty much identical. From a .NET coder point of view, it’s a little concerning that .NET apps will have a slight performance disadvantage from the COM Interop of the WinRT projected types, but I suppose that’s simply the whole ‘use .NET for productivity, C++ for performance’ argument. And having worked on a few apps that had all of the performant (and usually legacy) parts of the system in C++, with a .NET UI, and the subsequent marshalling layer, it’s quite heartening to think that now we can stay in C++ and write a fast and fluid UI without changing languages.

## RX Framework Performance Awareness

In a post a while ago, here, I implemented a DifferentiateWithTime operator, which was implemented in terms of other RX operators, and at the end I closed out by saying “It does say in section 6.1 of the RX Design Guidelines that new operators should be composed of existing operators, unless performance is a concern – this is something I may get around to investigating in a future blog post.” and thought I’d finally get around to looking at this.

As a recap, the differentiate operator was written using a sliding window operator (updated to use RX 1.1 experimental):

        public static IObservable<double> DifferentiateWithTime            (this IObservable<double> obs)        {            return (from j in obs.SlidingWindow(2)                     select j[1] - j[0]);        }

// from http://social.msdn.microsoft.com/Forums/en-US/rx/thread/37428f58-f241-45b3-a878-c1627deb9ac4#bcdc7b79-bbde-4145-88e4-583685285682        public static IObservable<IList<TSource>>             SlidingWindow<TSource>(this IObservable<TSource> source,            int count)        {            Contract.Requires(source != null);            Contract.Requires(count >= 0);            return source.Publish(published =>                from x in published                from buffer in published.StartWith(x).Buffer(count).Take(1)                where buffer.Count == count                select buffer            );        }
I created a simple test fixture to exercise this.
const double accelerationGravity = 9.81;            var count = 10000.0;            var positions = Observable.Generate(0.0,                i => i < count,                i => i + 1.0,                i => accelerationGravity * i * i / 2.0);            var stopwatch = new Stopwatch();            stopwatch.Start();            var velocity = positions.DifferentiateWithTime();            var sumVelocity = 0.0;            using (velocity.Subscribe(i =>                            { sumVelocity += i; })) { };            var acceleration = velocity.DifferentiateWithTime();            var sumAcceleration = 0.0;            using (acceleration.Subscribe(i =>                                { sumAcceleration += i; })) { };            stopwatch.Stop();            Console.WriteLine("sumVel:{0} sumAcc:{1} time:{2}",                 sumVelocity,                sumAcceleration,                stopwatch.ElapsedMilliseconds);

Monitoring in Perfmon, I saw that during the 10000 iterations there were 170 Generation 0 garbage collections, which does seem quite overkill. The total time was 6580 milliseconds.
I then implemented the DifferentiateWithTime operator directly:
public static IObservable<double> DifferentiateWithTime            (this IObservable<double> obs)        {            return Observable.Create<double>(o =>            {                double previousValue = 0.0;                var initialized = false;                return obs.Subscribe(i =>                {                    if (initialized)                    {                        o.OnNext(i - previousValue);                    }                    else                    {                        initialized = true;                    }                    previousValue = i;                });            });        }

This time there were only 4 Generation 0 garbage collections during the test run, and even more excitingly, the execution time was 379 milliseconds. A 17 time speedup is quite impressive, and shows that it’s worth being careful!
It’s likely that other operations based on sliding windows (rolling averages, VWAP etc) may have similar issues, and may benefit from similar observations. Instead of forcing the user to manually make these changes, it may be possible to do this in a more automated way (I’ve got some ideas I’ll play with when I get time.)



## Quick Play with the AMD Stream SDK

I mentioned in my last blog post that I was disappointed with the performance of Microsoft Accelerator, and wanted to play around with Brahma. I was going to do this sooner, but have been side-tracked with playing around with XNA on Windows Phone 7.

I downloaded the latest OpenCL version of Brahma, but had trouble with the nested loops and aggregation operations (force summations), so didn’t get as far as I’d hoped. It’s a shame, as the concept of LINQ to GPU is a great one.

I then took a look at running the OpenCL NBody simulation from the Stream SDK. I couldn’t get the simulation to run using the GPU despite trying various Catalyst versions, it failed with a runtime error message "This OpenCL build requires verison 1.4.879, version 1.4.696 installed", but in spite of this, I was impressed with the performance of using the Stream SDK, even running on the CPU.

Whereas my managed CPU-version of the nbody simulation achieved 5 fps (frames per second), (or 8 fps with the drawing disabled – as discussed earlier the WPF drawing code is slow), drawing 2000 bodies, the OpenCL SDK ran at 25 fps drawing 2048 bodies, i.e. a factor of 5 speedup. I didn’t bother to parallelise my code but the theoretical maximum speedup on my dual core machine would obviously be a factor of 2, so that’s a factor of 2.5 speedup using the Stream SDK on the same hardware.

I switched the Stream SDK NBody example to use the nBodyCPUReference() method to see whether it’s slow because of the difference between managed and native code, and it runs at 5 fps compiled native on the CPU, i.e. in the same ballpark as the managed version. As it’s not running on the GPU, the Stream version must be faster than the vanilla C++ version because it’s making use of the processor’s vector hardware, but I can’t be bothered to manually code the SSE intrinsics to see if that’s the case (but it might be cool to play around with Mono.SIMD if I get time).

Oh, I suppose I should talk about how the code looks – the guts of the algorithm doesn’t look much different between the vanilla C++ and the OpenCL version, but there is a lot of hideous boilerplate/setup code different between the two. This is why it’d be great to get a workable managed library to hide all this (alternately, it’ll be interesting to see whether C++ AMP abstracts away the OpenCL/DirectCompute complexity).

## GPGPU–playing with Microsoft Accelerator

It’s probably being screamingly obvious to some readers that the boids simulations I’ve been playing with are embarassingly parallel, so I thought I’d have a quick play.

I’ve been reading around about OpenCL and CUDA, but as there’s a Microsoft library with a .NET API for easily programming the GPU, I thought I’d have a play with Accelerator (another interesting .NET GPGPU library is Brahma – I might get around to playing with that one day). Accelerator is higher-level, no need to worry about the low-levels of the GPU memory management.

I decided to play with a simpler example than the boids, to focus on the technology instead of the problem domain. I chose to look at the all-pairs NBody simulation (see more info here).

I quickly coded up a simple example using 1000 bodies. The CPU was able to draw at approx 15 frames per second (I didn’t bother parallelising the simulation on the CPU, as I was hoping for an order of magnitude increase in speed on the GPU). WPF is incredibly slow in drawing, and I found (unexpectedly) that using DrawingVisuals to be even slower. For that reason, I’m only drawing 100 bodies, but all of them are included in the simulation. I was intending to reduce the bottleneck by using Direct2D, and then getting Accelerator to write out to texture memory to save transferring data over the bus.

I didn’t get the results I expected when using Accelerator – I first began by converting the main simulation routine (integration of positions) onto the GPU, and left the all-body force calculation on the CPU. I was surprised to find the simulation slower – I was hitting frame rates of 10 fps.

I guessed that maybe it was maybe transferring too much data between the CPU and GPU, so I then moved onto the force calculation. I was very surprised to find that this made the simulation orders of magnitude slower (i.e. hitting frame rates < 0.01 fps). I profiled this to find that the majority of the time was spent in CompileShader. This isn’t so surprising – I was building up the same calculation for each body, for each frame.

Following the advice in the Accelerator Programmers Guide, I then moved onto using Parameter Objects. This means that it’s able to use the same computation graph with different input data. This did help, but only by an order of magnitude. It’s still not approaching anywhere near real-time frame rates.

I can’t remember where I read it, but I read that it’s recommended using input data sizes of the order 1e6 elements to overcome the overhead of transferring data to and from the GPU. This does make sense, but I was expecting to be at least getting interactive frame rates (as the OpenCL simulations are obtaining). It may be that Accelerator is faster than the CPU with a large number of elements, but it may be that e.g. instead of rendering a frame in an hour, it takes five minutes. It doesn’t seem to be suitable for interactive simulations.

This could be a simple case of user-error. I’ve got the code available on taumuon. If I’m missing something obvious, or you can get faster frame rates than the CPU, please post in the comments

(As I’m discussing performance I guess I should disclose the software and hardware specs. Running Windows x64, on a HP DV3 laptop – 4GB ram, dual core Pentium P6100, ATI Radeon 5470).

## F# Flocking (boids)

A couple of years ago I was blogging about some simple boid behaviour I had implemented in C# (here). I came across this paper which had some interesting ideas I wanted to try out: "Self-organised complex aerial displays of thousands of starlings: a model" by H. Hildenbrandt, C. Carere, C-K. Hemelrijk.

As I’m getting more into F#, I thought this would be a simple reasonable sized application to create to get a feel for how F# works on a slightly larger project (though definitely not large-scale).

Before taking a look at the code, I’ll quickly describe which parts of the paper I implemented.

The video generated is available on http://www.youtube.com/watch?v=eil5K7Ir3i8

I haven’t implemented all of the ideas in the paper, so my simple application doesn’t exhibit the same realistic flocking behaviour, but it does have more realistic behaviour than my earlier efforts. The interesting behaviours were the desire to not stray too far from the roost, the attempt to maintain a steady cruise speed, and to use the nearest seven topological neighbours for cohesion and alignment. I implemented the simulation in 2D for simplicity.

I hardcoded a perception radius instead of implementing the continual adjustment of the perception radius for seven topological neighbours. I also have totally omitted the flight dynamic simulation (no gravity, no banked turns). The paper discusses that the banked turns gives a further air of realism to the simulation.

### The coding

This was a real pleasure to implement in F# – a lot less time was thinking about application design, classes and their interactions, the logic in the paper was easy to transfer directly to the keyboard. F# is very compact, so the core logic can be seen on just a couple of screens.

I didn’t feel the lack of intellisense too much when developing the core algorithm, I did miss the ability to refactor functions and navigate to usages. I especially missed intellisense when developing the WPF visualisation part of the app, when interacting with .NET objects; I did miss the ability for Visual Studio to automatically add the required namespaces. I must have being spoilt by Visual Studio and Resharper for too long!

The actual WPF application wasn’t such a good fit for F# – there’s no generated code behind file for the XAML, and I feel that using F# would be painful in a WPF or Silverlight application (but just for the view, it should be OK for the ViewModels down).

I implemented a F# Vector type, which can be specified with units of measure (or none). I used units of measure throughout – this was really powerful, and did let me quickly find a few bugs in the implementation.

(NOTE: I still haven’t found a code websnippet tool I’m happy with – you need to click in each of the regions below and scroll down and right to see the whole code). Alternately, the zipped up solution can be downloaded from http://www.taumuon.co.uk/jabuka/FSharpFlock.zip

Vector3D.fs:

module Vector3Dtype Vector3D<[<Measure>] 'u>(x : float<'u>, y : float<'u>, z : float<'u>) =     static member Zero() = Vector3D<_>(0.0<_>, 0.0<_>, 0.0<_>)    member v.X = x    member v.Y = y    member v.Z = z    static member (+) (lhs:Vector3D<_>, rhs:Vector3D<_>) =      Vector3D(lhs.X + rhs.X, lhs.Y + rhs.Y, lhs.Z + rhs.Z)    static member (-) (lhs:Vector3D<_>, rhs:Vector3D<_>) =      Vector3D(lhs.X - rhs.X, lhs.Y - rhs.Y, lhs.Z - rhs.Z)    static member (*) (v:Vector3D<_>, a:float<_>) =      Vector3D(v.X * a, v.Y * a, v.Z * a)    static member (*) (a:float<_>, v:Vector3D<_>) =      Vector3D(a * v.X, a * v.Y, a * v.Z)          static member (/) (v:Vector3D<_>, a) =      Vector3D(v.X / a, v.Y / a, v.Z / a)    member v.DotProduct(rhs:Vector3D<_>) = (v.X * rhs.X) + (v.Y * rhs.Y) + (v.Z * rhs.Z)    member v.magnitude = sqrt(v.DotProduct(v)) * 1.0<_>    member lhs.CrossProduct(rhs:Vector3D<_>) =      Vector3D((lhs.Y * rhs.Z - lhs.Z * rhs.Y) * 1.0<_>,          (-lhs.X * rhs.Z + lhs.Z * rhs.X) * 1.0<_>,          (lhs.X * rhs.Y - lhs.Y * rhs.X) * 1.0<_>)    member v.normalise =       let magnitude = float v.magnitude      Vector3D<_>((v.X / magnitude), (v.Y / magnitude), (v.Z / magnitude))let sumVectors(vectors : Vector3D<_>[]) =   let initial = Vector3D<_>(0.0<_>, 0.0<_>, 0.0<_>)  Array.fold (+) initial vectors

BoidUtils.fs:

module BoidUtilsopen Microsoft.FSharp.Mathopen Vector3Dopen SIlet radiusRoost = 150.0<m>let hardRadius = 2.0<m> // 0.2<m>let mass = 0.08<kg>let timeStep = 0.005<s>let relaxationTime = 0.05<s>let cruiseSpeed = 20.0<m/s>let horizontalRoostWeighting = 0.01<N/m>let weightingAlignment = 0.5<kg * s^-2>let weightingCohesion = 1.0<kg s^-2>let weightingSeparation = 2.0let perceptionRadius = 50.0<m>type BodyAxes =  { Forward:Vector3D<1>;    Side:Vector3D<1>;    Up:Vector3D<1> }type Boid =   { Position:Vector3D<m>;    Speed:float<m / s>;    Orientation:BodyAxes; }// All parameterless functions are evaluated once, just on module opening, so pass random in.let InitialiseRandomPosition(rand:System.Random) =   Vector3D<m>((300.0 * (rand.NextDouble()-0.5)) * 1.0<m>,             ((300.0 * (rand.NextDouble()-0.5)) * 1.0<m>),              0.0<m>)let InitialiseRandomVelocity(rand:System.Random) =   Vector3D<m/s>((100.0 * (-0.5 + rand.NextDouble()) * 1.0<m/s>),                (100.0 * (-0.5 + rand.NextDouble())) * 1.0<m/s>,                0.0<m/s>)let InitialiseRandomOrientation(rand:System.Random) =   {Forward=Vector3D(0.0, 1.0, 0.0);   Side=Vector3D(1.0, 0.0, 0.0);   Up=Vector3D(0.0, 0.0, 1.0)}let setOrientation(oldOrientation:BodyAxes, velocity:Vector3D<m/s>) =  let normalisedVelocity = velocity.normalise  let y = normalisedVelocity.CrossProduct(Vector3D<m / s>(0.0<m/s>, 0.0<m/s>, 1.0<m/s>))  {oldOrientation with Forward=normalisedVelocity * 1.0<m^-1 s>; Side=y*1.0<m^-2 s^2>}let calculateCruiseSpeedForce (boid:Boid) =  (mass / relaxationTime) * (cruiseSpeed - boid.Speed) * boid.Orientation.Forwardlet calculateRoostForce (boid:Boid) =  let horizontalPosition = Vector3D(boid.Position.X, boid.Position.Y, 0.0<_>)  let distanceFromOrigin = horizontalPosition.magnitude  match (distanceFromOrigin) with  | _ when distanceFromOrigin < radiusRoost -> Vector3D<N>(0.0<N>, 0.0<N>, 0.0<N>)  | _ -> let normalRoostingArea = horizontalPosition.normalise         let d = boid.Orientation.Forward.DotProduct normalRoostingArea         let distanceFromRoost = distanceFromOrigin - radiusRoost         let orientationRoostDotProduct = boid.Orientation.Side.DotProduct normalRoostingArea         let weightingFactor = match (orientationRoostDotProduct) with                               | _ when orientationRoostDotProduct > 0.0<m> -> -1.0                               | _ -> 1.0         weightingFactor * (radiusRoost * horizontalRoostWeighting * (0.5 + (0.5<m^-1> * d)) * (boid.Orientation.Side))let findDistanceBetweenBoids(boid:Boid, other:Boid) =   (boid.Position - other.Position).magnitudelet findNearestNeighbours(boid:Boid, boids:Boid[]) =  let sortedByDistance = boids |> Array.sortBy(fun other -> findDistanceBetweenBoids(boid, other))  Array.sub sortedByDistance 0 7let findAverageForwardDirectionDifference(boid:Boid, boids:Boid[]) =   let differences = boids |> Array.map (fun i -> 1.0<m> * (i.Orientation.Forward - boid.Orientation.Forward))  let sumDifferences = sumVectors(differences)  (1.0 / (float sumDifferences.magnitude)) * sumDifferenceslet calculateAlignmentForce(boid:Boid, nearest:Boid[]) =  let averageDifference = findAverageForwardDirectionDifference(boid, nearest)  weightingAlignment * averageDifferencelet findAveragePosition(boid:Boid, boids:Boid[]) =  let positions = boids |> Array.map (fun i -> i.Position)  let sumPositions = sumVectors(positions)  (1.0 / float boids.Length) * sumPositionslet findNeighboursInRadius(boid:Boid, boids:Boid[], radius:float<m>) =  boids |> Array.filter(fun other -> other <> boid && findDistanceBetweenBoids(boid, other) <= radius)let calculateCentrality(boid:Boid, boids:Boid[]) =   let separations = boids |> Array.map(fun i -> (i.Position - boid.Position).normalise)  let sumSeparations = sumVectors(separations)  let count = boids.Length  match (count) with  | 0 -> 1.0  | _ -> (1.0 / float count) * (sumSeparations.magnitude / 1.0<m>)let calculateCohesionForce(boid:Boid, nearest:Boid[], boidsInPerceptionRadius:Boid[]) =  let boidsOutsideHardRadius = nearest |> Array.filter(fun i -> abs ((boid.Position - i.Position).magnitude) > hardRadius)  let centrality = calculateCentrality(boid, boidsInPerceptionRadius)  let averagePosition = findAveragePosition(boid, nearest)  centrality * weightingCohesion * (averagePosition - boid.Position)let calculateSeparationForce(boid:Boid, boidsInPerceptionRadius:Boid[]) =   let nearest = boidsInPerceptionRadius  let separations = nearest |> Array.map(fun i -> i.Position - boid.Position)  let sigma = 1.8  let forcesToNeighbours = separations |> Array.map(fun i ->     let magnitude = i.magnitude    let multiplier =         match (magnitude) with        | _ when magnitude < hardRadius -> 1.0        | _ -> System.Math.Exp(-((magnitude - hardRadius)*(magnitude - hardRadius)/1.0<m^2>) / (sigma * sigma))    multiplier * magnitude * (i.normalise) * 1.0<kg * m^-1 * s^-2>)  let sumForces = sumVectors(forcesToNeighbours)  match (nearest.Length) with    | _ when (nearest.Length) = 0 -> Vector3D<N>.Zero()    | _ -> (-weightingSeparation / float nearest.Length) * sumForceslet calculateSocialForce(boid:Boid, boids:Boid[]) =  let nearest = findNearestNeighbours(boid, boids)  let boidsInPerceptionRadius = findNeighboursInRadius(boid, boids, perceptionRadius)  calculateAlignmentForce(boid, nearest)  + calculateCohesionForce(boid, nearest, boidsInPerceptionRadius)  + calculateSeparationForce(boid, boidsInPerceptionRadius)let calculateForce (boid:Boid, boids:Boid[]) =  (boid |> calculateRoostForce)  + (boid |> calculateCruiseSpeedForce)   + (calculateSocialForce(boid, boids))let iterateBoid (boid:Boid, boids:Boid[]) =   let originalPosition = boid.Position  let originalVelocity = boid.Speed * boid.Orientation.Forward  let force = calculateForce(boid, boids)  let acceleration = force/mass  let velocity = originalVelocity + (acceleration * timeStep)  let position = originalPosition + (velocity * timeStep)  let newOrientation = setOrientation(boid.Orientation, velocity)  {Position=position;Speed=velocity.magnitude;Orientation=newOrientation}



Program.fs:

module Boidsopen SIopen Vector3Dopen BoidUtilsopen Systemopen System.IOopen System.Windowsopen System.Windows.Threadingopen System.Windows.Controlsopen System.Windows.Shapesopen System.Windows.Mediaopen System.Windows.Media.Imaginglet window = Application.LoadComponent(new System.Uri("/FSharpFlock;component/MainWindow.xaml",                                           System.UriKind.Relative)) :?> Windowlet rand = new System.Random()let viewWidth = 480.0let viewHeight = 360.0let mutable frameCount = 0let mutable boids =  [|for i in 0 .. 300 ->    let position = InitialiseRandomPosition(rand)    let velocity = InitialiseRandomVelocity(rand)    let tempOrientation = {Forward=Vector3D(0.0, 1.0, 0.0);                           Side=Vector3D(1.0, 0.0, 0.0);                           Up=Vector3D(0.0, 0.0, 1.0)}    let orientation = setOrientation(tempOrientation, velocity)    {Position = position; Speed=velocity.magnitude; Orientation=orientation}    |];;let updateBoids(boids) =  boids |> Array.map (fun boid -> iterateBoid(boid, boids))let (?) (fe:FrameworkElement) name : 'T =  fe.FindName(name) :?> 'Tlet GetRotationForBoid(boid:Boid) =  let forward = boid.Orientation.Forward  let angleRadians = Math.Atan2(forward.X, forward.Y)  let angleDegrees = angleRadians * 180.0 / Math.PI  let rotateTransform = new Media.RotateTransform(angleDegrees, 0.0, 0.0)  rotateTransformlet color = Media.Color.FromArgb(64uy, 255uy, 0uy, 0uy)let saveFrame(canvas : Canvas) =   let size = new Size(viewWidth, viewHeight)  canvas.Measure(size)  canvas.Arrange(new Rect(size))  let renderTargetBitmap = new RenderTargetBitmap(int viewWidth, int viewHeight, 96.0, 96.0, PixelFormats.Pbgra32)  let sourceBrush = new VisualBrush(canvas)  renderTargetBitmap.Render(canvas)  let bitmapFrame = BitmapFrame.Create(renderTargetBitmap)  let jpegEncoder = new JpegBitmapEncoder()  jpegEncoder.Frames.Add(bitmapFrame)  let filename = System.String.Format("C:Anim{0:0000}.jpg", frameCount)  use stream = new FileStream(filename, FileMode.CreateNew)  jpegEncoder.Save(stream)  let createBoidGraphics(boid:Boid) =  let obj : Polygon = new Polygon()  obj.Points <- new Media.PointCollection( [|new Point(-10.0, -5.0);                                             new Point(0.0, 0.0);                                             new Point(-10.0, 5.0);                                             new Point(-5.0, 0.0)|] )  obj.RenderTransform <- GetRotationForBoid(boid)  obj.Fill <- new Media.SolidColorBrush(color)  obj.Stroke <- Media.Brushes.Black  obj.StrokeThickness <- 1.0  objlet drawBoids() =  let win : Window = window  let canvas : Canvas = win?canvas  canvas.Children.Clear()  for i = 0 to boids.Length - 1 do    let graphicalBoid = createBoidGraphics(boids.[i])    let unitlessPosition = boids.[i].Position * 1.0<m^-1>    System.Windows.Controls.Canvas.SetTop(graphicalBoid, (viewHeight / 2.0) + (unitlessPosition.X))    System.Windows.Controls.Canvas.SetLeft(graphicalBoid, (viewWidth / 2.0) + (unitlessPosition.Y))    canvas.Children.Add(graphicalBoid) |> ignore  // saveFrame(canvas)  frameCount <- frameCount + 1let timer = new DispatcherTimer();timer.Tick.Add(fun evArgs ->   [0..5]  |> Seq.iter(fun x -> boids <- boids |> updateBoids)  drawBoids()  )timer.Interval <- TimeSpan.FromMilliseconds(1.0);let setup(win: Window) =  win.Loaded.Add(fun _ -> timer.Start())[<STAThread>][<EntryPoint>]let main args =  setup window  (new Application()).Run(window) |> ignore  0

MainWindow.xaml:

<Window xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"        Title="F# Flock" Height="360" Width="480">    <Canvas Name="canvas" HorizontalAlignment="Stretch" VerticalAlignment="Stretch"     Background="LightBlue"/></Window>

The rest of the code I hope is simple to read and understand. Coding this up I had some of the most fun programming for a while. With F#, I found I was able to concentrate on the core of the algorithm and very quickly see results.

For this type of project, very algorithmic, F# was a perfect fit. I’d like to get a feel for how to implement F# on a large scale project, and to see how it feels to expose the functionality via an object oriented layer.

## F# Units of Measure with the Reactive Framework

My last blog post dealt with manipulating incoming streams of data, such of position data, composing those streams, and manipulating them, and saw how well RX handles these operations.

As I was doing this, I felt that this would benefit from the extra safety that you can get from F#’s Units of Measure feature.

We can implement an operation to find the separation between two positions (the same as in my previous C# post):

open Systemopen System.Linqopen System.Threadingopen System.Collections.Genericopen Microsoft.FSharp.Mathopen Microsoft.FSharp.Linqopen Microsoft.FSharp.Linq.Querymodule ObservableEx =  type System.IObservable<'u> with    member this.WindowWithCount(count:int)      = this.Publish(Func<_,_>(fun (p:IObservable<'u>) ->         p.SelectMany(          Func<_,_>(fun x -> p.StartWith(x).BufferWithCount(count).Take(1)),          Func<_,_,_>(fun x buffer -> buffer)).Where(            Func<IList<'u>,_>(fun x -> x.Count = count)).Select(              Func<_,_> (fun x -> x))));open ObservableExlet accelerationGravity = 9.81;let positions = Observable.Generate(0.0,                  Func<_,_>(fun i -> i < 10.0),                  Func<_,_>(fun i -> i + 1.0),                  Func<_,_>(fun i -> accelerationGravity * i * i / 2.0)); let positions2 = Observable.Generate(0.0,                  Func<_,_>(fun i -> i < 10.0),                  Func<_,_>(fun i -> i + 1.0),                  Func<_,_>(fun i -> i)); let separation = Observable.Zip(positions,                   positions2, Func<_,_,_>(fun i j -> j - i)); let res = separation.Subscribe(fun i -> i |> printfn "%f");

But let’s see what happens if we accidentally do something physically meaningless such as adding a velocity to a position:

let accelerationGravity = 9.81; let positions = Observable.Generate(                  0.0,                  Func<_,_>(fun i -> i < 10.0),                  Func<_,_>(fun i -> i + 1.0),                  Func<_,_>(fun i -> accelerationGravity * i * i / 2.0)); let positions2 = Observable.Generate(0.0,                   Func<_,_>(fun i -> i < 10.0),                   Func<_,_>(fun i -> i + 1.0),                   Func<_,_>(fun i -> i)); let DifferentiateWithTime (input: IObservable<'a>) =  input.WindowWithCount(2).Select(fun (j:IList<'a>) -> (j.[1]-j.[0])); let velocities = DifferentiateWithTime(positions); let separation = Observable.Zip(positions, velocities, Func<_,_,_>(fun i j -> j - i)); let res = separation.Subscribe(fun i -> i |> printfn "%f");

The program compiles and runs normally (as we’d expect, the compiler doesn’t know better than the fact that it’s dealing with some float values).

Now, let’s annotate our code with units of measure (I’m using the F# PowerPack). We can calculate the difference between two positions:

let accelerationGravity = 9.81<SI.m SI.s^-2> let positions = Observable.Generate(0.0<SI.s>, Func<_,_>(fun i -> i < 10.0<SI.s>),                   Func<_,_>(fun i -> i + 1.0<SI.s>),                  Func<_,_>(fun i -> accelerationGravity * i * i / 2.0)); let positions2 = Observable.Generate(0.0<SI.s>,                  Func<_,_>(fun i -> i < 10.0<SI.s>),                  Func<_,_>(fun i -> i + 1.0<SI.s>),                  Func<_,_>(fun i -> 5.0 * accelerationGravity * i * i / 2.0)); let DifferentiateWithTime (input: IObservable<float<_>>) =   input.WindowWithCount(2).Select(    fun (j:IList<float<_>>) -> (((j.[1]-j.[0])/1.0<SI.s>))); let velocities = DifferentiateWithTime(positions); let accelerations = DifferentiateWithTime(velocities); let separation = Observable.Zip(positions, positions2, Func<_,_,_>(fun i j -> j - i)); let res = separation.Subscribe(fun i -> float i |> printfn "%f"); // Next line will not compile //let wrongseparation = Observable.Zip(//                        accelerations, velocities, Func<_,_,_>(fun i j -> j - i));

But if we try instead to calculate the difference between the position and the velocity, the code will no longer compile. This is very cool.

We can also do the same by annotating our IObservables with units:

let accelerationGravity = 9.81<SI.m SI.s^-2> let DifferentiateWithTime (input: IObservable<float<_>>) =   input.WindowWithCount(2).Select(    fun (j:IList<float<_>>) -> (((j.[1]-j.[0])/1.0<SI.s>)));let positions = Observable.Generate(0.0<SI.s>,                  Func<_,_>(fun i -> i < 10.0<SI.s>),                   Func<_,_>(fun i -> i + 1.0<SI.s>),                  Func<_,_>(fun i -> accelerationGravity * i * i / 2.0)); let velocities = DifferentiateWithTime(positions); let accelerations = DifferentiateWithTime(velocities); let res = positions.Subscribe(fun i -> float i |> printfn "%f"); printfn "-- velocities -- ";let res2 = velocities.Subscribe(fun i -> float i |> printfn "%f");printf "-- accelerations -- "; let res3 = accelerations.Subscribe(fun i -> float i |> printfn "%f");

I love this feature, and can see that it would be incredibly useful with RX, as RX statements can include all sorts of streams of data into a complex operation.

(I was originally mistaken in the original version of this posting, and thought that I couldn’t create the DifferentiateWithTime method to be generic to the units of measure, but was saved by a posting on stackoverflow, here).