Fritule are similar to mini doughnuts. In Croatia, it is traditional to eat fritule on days of fast, such as Good Friday or Christmas Eve.

Soak a handful of raisins in rum. Beat two eggs, add 50g of vanilla sugar and beat in 200g of yogurt. Then, mix in 300g of self-raising flour and then the rum and raisins. You can also add other flavours, such as cinnamon or lemon rind.


Heat a pan of oil, to a medium-high heat; you want the fritule to cook fully throughout before the outer exterior gets too dark.


Turn once while cooking so that each side is evenly cooked, before removing with a slotted spoon and leaving to dry on kitchen towel. Finish by dusting with icing sugar.




It’s tradition in Croatia to eat Pašticada sa domaćim njokama (pot roast with home made potato gnocchi), and the njoki/gnocchi can be made in advance. They’re not at all tricky to make, and are delicious covered in thick rich gravy.

Take some fluffy potatoes (I’ve got Maris Piper) and place them into a pan of cold salted water, and bring to the boil.


Once boiled, the skins of the potatoes turn really soft, and can be peeled off. Unless you’ve got asbestos fingers you might find it easier to use a sharp knife.

Now, prepare the work surface with flour and rice the potatoes. Ricing instead of mashing helps the moisture in the potatoes evaporate. If you think the potatoes are soggy you could pop them into an oven for a couple of minutes after peeling. We used strong bread flour – apparently the protein in the flour helps to soak up moisture.


Make a well in the potatoes and add an egg, some salt, and a generous knob of butter.


Bring all the ingredients together to form a dough, but don’t over-work them. You’re not baking bread, so definitely don’t go kneading it!


Roll the dough into even portions, and then take each portion and roll it into long sausages. Chop each sausage evenly.


Some recipes recommend pressing each gnocchi against a fork, but rolling out against a grater adds a better looking finish, and helps the gnocchi soak in and hold a generous amount of the rich gravy.

Press the gnocchi flat against the grater, and push downwards so that it flattens out against the grater. Then, roll it around to form the gnocchi shape.


Binomial Tree Option Pricing in F#

This post will look at implementing the binomial tree algorithm in an imperative style, and refactoring it to be more functional.

The wikipedia page on the binomial options pricing model provides pseudo-code for the pricing of an American Put option.

That code can be implemented almost identically in F#.


and is called as follows:


Functional programmers have probably died a little on the inside on seeing that code, but one of the good thing about F# being multi-paradigm is that it’s easy to take code of this form and implement it, and test it for correctness before refactoring.

I did actually look at whether the results of the code were as expected at this point, but I’ll cover this later in this blog post.

The first thing to do is to change the initial values V to be created more functionally:




Now, the main algorithm consists of two parts, an outer loop where each ‘level’ of the tree is calculated stepping up towards the root, using the results of the previous level’s nodes (the nodes nearest to the leaves). The inner loop iterates over all of the nodes at a given level calculating the option’s value at that node.

First off, the outer loop can be replaced with a recursive method:


The next change is quite minor – it’s for each pass of the inner loop to create a new array, instead of mutating the vi array:


This change now means that there’s no need for any mutable state, so the initial values can be created as a sequence instead of an array, and the inner loop can be written in a functional style:


I find this code style is clearer to understand than following the array mutations to see what each part of the code is doing and how those parts interact.

If I was implementing this in C# I’d probably change the variable names to be more descriptive (e.g. strikePrice, riskFreeRate etc) but as one of the key advantages of F# is its succinctness it makes more sense to leave the well-known symbols in. Anyone implementing the algorithm with knowledge of the domain would know what the symbols mean without the more verbose names.

Validating the results

The calculated option value with the requested parameters is 4.486090. Running the EqutyOptions example in the .NET port of QuantLib, QLNet (, with the same parameters gives the following results:


The figures are in the same ballpark, but the difference is curious.

Chapter 45 of GPU GEMS 2  creates the initial values in this way:



This gives a result of 4.486375 which still doesn’t match QLNet, but is nearer.

The result of running the Cox version of Bubo’s code from here also equals 4.486375, which matches the results of using this initialisation function. As an aside, I like how that examples shows how F# partial function application allows the variations on the algorithm to be composed, and much more succinctly than using OO.

Bubo’s Tian model does match QuantLib’s result exactly. I’ll have to dig a bit deeper to figure out which variations on the binomial tree model these different examples are using.

The code is available here:

Metro Bullet source code

I’ve updated the demo that shows the Bullet Physics Engine in a DirectX 11 Metro application to build in the Release Preview drop of Visual Studio (original blog post: ).

The code needs some tweaks – I need to change the manifest, and change some stuff over to Platform::Agile, but it builds and runs so I thought I’d get it out there.

The code is available at:

More playing with monads

This blog post will look at using the continuation monad to sum the nodes in a tree, and then look at a case where a more complex monad is needed.

The book Real World Functional Programming gave an example of an unbalanced tree, which was deeply nested such that recursing over the tree to sum the values would blow the stack:



The solution provided was to sum the tree using continuation passing:


This is now able to sum the tree without crashing, but the code doesn’t flow as well as in the original sumTree method. However, monads can come to the rescue again, in this case the continuation monad:


The code flow now is easier to understand. The continuation passing style and continuation monad turns the code flow inside out, which isn’t a big deal, but means that the func to perform on the result needs to be passed in:


Parallel sum

This is all well and good, but to exploit multicore machines, I want to take advantage of the parallelism in the tree summation algorithm, by summing the different branches of the nodes using tasks. A naïve implementation to do this may be:



The problem with this is that it runs slower than the sequential case for a large balanced tree –if there are only 4 cores in the system it’s counter-productive to create thousands of tasks. The current depth of the tree should be passed around, and tasks spawned until traversal reaches that depth:




This is now faster for a large balanced tree, but for the deeply nested unbalanced tree it throws a StackOverflowException. Note that sumTreeParallel coped with arbitrarily nested trees, as the work to be done no longer kept being pushed onto the stack.

More state is being passed around (the current depth), which is external to the main business logic – it does feel like the state monad could be used here. First though, this is fixed up so that it has the efficiency of the parallel summation, but can cope with arbitrarily deeply nested trees:


Urghhhh. The code is definitely more convoluted now – I played with using the state monad to pass the depth around, but it’s not tail recursive. I thought that it might be possible to make use of the continuation monad here, but the problem is, is that it really needs aspects of both the state and continuation monads. I’m not sure how monad combinations work in F#, but I thought I’d throw this out here to see if anyone shows me the light. The post The Mother of All Monads does say that all other monads could be implemented in terms of the continuation monad, but I couldn’t see an example of anyone implementing the state monad in terms of the continuation monad.

State Monad in C# and F#

My colleagues and I recently have been trying to gain a deeper understanding of monads, other than using LINQ and RX. Watching all the channel9 videos on monads, the concept on the surface seemed quite simple, but we were felt that we were somehow missing some key point. After reading various blog posts, the penny is finally dropping.

Some post I read somewhere (I can’t remember where) said that it’s easier to understand the individual monads before trying to understand the whole generalised concept. Mike Hadlow’s blog series ( really explain well the identity and maybe monads (which we already understood, which is just as well as we’re using the maybe monad in our codebase). Unfortunately he didn’t explain the state monad.

The only information we could find on the state monad in C# was Brian Beckman’s Channel 9 video:

We found Brian Beckman’s example overly complex, there’s quite a lot going on in that video (and the code needed de-obfuscating to read more like idiomatic C#), but it’s probably a good second or third example once the basics are understood. I wanted to look at a simpler example, and found this article: the explanation of the get and put methods show pretty clearly how the state monad works, and you should probably read and understand that section before carrying on here, as I’m going to jump straight on to show how the random number generator example looks in F# and C#.

C# implementation

First off, here’s an OO random number generator:


It is easy to add three random numbers together using an instance of the generator.


We can make NextShort purely functional, so that calling it has no side-effects, by passing the current state in, and returning the new state back out, along with the result:


To add three numbers, the state has to be kept around, and passed to each of the calls in turn. This obfuscates the main business logic.


Jumping straight to the punch line, using a state monad means that the boilerplate state management can be hidden away so that the algorithm reads as clearly as it did before:



This is a trivial example, but hopefully it makes the point. The rest of this blog post will show the implementation, along with the same in F#.

First off, we have a state class, which holds a Func that given a state, returns a tuple of the state and value. The F# example doesn’t bother with the State class, and I could have made the C# version simply use the Func everywhere, but it seemed clearer to me to have the wrapper object just to get a bit of context.


I created a non-generic State class as somewhere convenient to put the Get and Set methods:


The meat of the monad lives in extension methods in a StateEx class:


and SelectMany is implemented in terms of Bind the same as in Mike Hadlow’s example for the Identity and Maybe monads:


The computation to compose can be expressed in the GetRandom method:


This returns an operation (a Func in the State container) that invokes NextShort with the old state, and returns a computation with the new state and value.

The resulting composed operation using LINQ is described above, but this is what it looks like desugared:


F# implementation

The code for the random number generator and composed operation look similar to the C#:


This has the same problem of explicitly passing the state around. I can ruin the surprise again and show how using monads hides the explicit state passing:


This code makes use of F#’s support for monads via Computation Expressions. The code for the state builder is from here:


The Haskell example also had the >> operator, but of course F# uses that as the forward composition operator J

Instead I’ve implemented:


getRandom looks pretty much identical to the Haskell example:



The composed version using the StateBuilder is shown above, but here’s the desugared version:


The F# version is much clearer to read than the C#, in part due to the type inference meaning that the code has less pointless fluff. Having functions live in the module and not as part of a class aids that too (can simply call setState instead of State.Set in the C#)

Following through the simpler example really helped my understanding, and hopefully will see examples in the future where it will be sensible to use the state monad in my coding.

Monte Carlo C++ AMP

This blog is starting to look very inconsistent – my last blog post was talking about starting to write a game, and now I’ve gone onto a totally different topic. Due to time the game’s gone onto, if not the back burner, then a gentle simmer. I’ve got to balance doing cool stuff with keeping on top of various technologies that may be relevant as a contractor, and having a life!

This blog will describe using C++ AMP for calculating Pi; the calculation of Pi is a fairly good example of Monte Carlo calculations, as the algorithm’s simple, and we all know the result.

First off, this is what the algorithm looks like in C#:


It simply finds the number of circles representing x-y coordinate pairs, whose vector magnitude falls within the unit circle. This ratio is Pi / 4.

The random class is the Mersenne Twister code from here: The results are:


Replacing the Mersenne Twister class with System.Random resulted in code which was approximately 30% slower. I’m running this on a dual-core laptop, but have not parallelised it, as I didn’t fancy porting over a Parallel Random Number Generator (PRNG) myself.

Tomas Petricek has an example of using the GPU to calculate Pi using F# here, but his example generates the random numbers on the CPU and uploads them to the GPU to do the calculations and reduction.

C++ AMP Version

Microsoft have just released a C++ AMP Random Number generator library to codeplex, but I’m using BharathM port of the parallel Mersenne Twister described here:

His example generates random numbers into a rank 2 array of size 4096 * 2, but I’ve modified g_n_per_RNG to be 256.




The first thing I do is to pair up those random numbers into x-y coordinates and to find the square of their magnitude:



Then I determine whether the magnitude of these coordinates fall outside of the unit circle:





The final step is to do a parallel reduce of the resulting array. This is using the just-released reduction method from the C++ AMP algorithms library:


The reduce function behaves similar to STL’s std::accumulate in that you can pass in the binary operation to perform on each item. The function takes a rank 1 array, so I’m using view_as to change rank.

To perform the timings I run the algorithm twice, once to warm up (ensure all shaders etc are compiled – see )

The results I obtained are:


Calculating the result for half a million random numbers took approximately 16 milliseconds, whereas for the single threaded C# code a million iterations took 44 ms, so accounting for using both cores the performance seems pretty similar. This is quite disappointing, even though the GPU isn’t the most powerful (a mobile 5470), the CPU isn’t the most powerful either, so I was hoping for near an order of magnitude speed up. I wasn’t doing anything clever with tiling, and there may be other bits of the API I’m not using correctly. I’ll have to get hold of Kate Gregory’s book and try with different examples.

Bullet Physics and DirectX 11 on Windows 8 Metro

I’ve got a game idea which I think may work on a tablet so I’ve created the following ‘demo’. I’ve hooked together the Bullet Physics engine with some 3D graphics, and the falling wall of cubes could pretty much be the canonical “Hello, World” equivalent for a 3D game. I’m not sure how much time I’ll have to pursue my game idea, but it should in the worst case be a route to learning some new technologies.

The demo won’t win any awards for its graphics, and the video is pretty poor quality (I recorded the screen with my camera instead of using any screen capturing software).

Falling cubes

Microsoft seem to be definitely pushing for the use of C++ for game creation with Windows 8, with no announcement of a new version of XNA, but that seems to make sense; C++ is closer to the metal than C# and will get more performance out of lower power ARM tablets. Also, there’s quite a buzz with the C++ renaissance with C++11, and modern-style C++. The corresponding new technologies in VS 2010 (PPL) and VS 11 (auto-vectorization, C++ AMP) are especially cool.

I’m also keen to get back into C++ – the past year or so I’ve been working purely in C#, but all my prior experience was using a mixture of C# and C++, and I do miss C++ (even though I am more productive in C#). Also, my previous roles were generally in a modern style (use of STL, smart pointers etc.) but my most recent experience was “C with Classes” style code, so I’ve been relearning modern style as well as picking up the new C++11 features.

Building Bullet as a static lib was pretty easy, I just had to switch it over to use the Multi-threaded DLL c-runtime. Defining WINAPI_FAMILY=WINAPI_PARTITION_APP to detect any non-metro API usages revealed the usage of GetTickCount, which is unavailable on Metro apps, so I just turned off profiling by defining BT_NO_PROFILE 1 in btQuickprof.h

The Windows 8 samples including a couple of games which are decent guides on using DirectX 11. These demos aren’t supposed to be examples of a whole game engine, and don’t necessarily follow best practices: Marble Maze has a class DirectXBase which encapsulates all rendering logic, which is great, but the game logic is in a class MarbleMaze, which derives from this class. It’s nothing that would trip up an experienced developer, but beginners may follow the example. The Simple3DGame (which also demos DirectX and XAML interop) is better structured. The only strange thing in that example is that the render primitive classes are all C++/CX ref classes. This seems to contradict what I remember Herb Sutter saying in a Channel 9 video about sticking to ISO standard C++ where possible and only using C++/CX at the boundaries.

I’m not sure yet whether I’ll try to continue to create a simple demo of my game idea purely in C++, or create a C++/CX interop layer to facilitate writing the game logic in C# – I’m tempted to do this as I’m more productive in C#, and it’ll probably lead to a deeper understanding of WinRT which will be useful even when creating C# XAML applications.

I could have alternatively stayed purely in C# and used SharpDX, or a game framework such as ANX or Ogre (both of which say will soon be supporting XAML), but I was keen on having control of the whole game, C++ upwards – if after profiling any of the C# code any bottlenecks are found, it’ll be easier to re-implement those parts of the code in C++.

As to my experiences of developing the demo: The new version of C++ definitely feels more productive, and feels closer to C#. I love auto (var), lambdas (pretty much the same), for_each (foreach) and range for, and initializer lists. Also, async programming using PPL tasks feels very similar to .NET’s TPL. I’ve had to change mental gears a bit in thinking again about object ownership instead of relying on the garbage collector, and it took way longer than it should for me to figure out how to add a std::unique_ptr into a vector.

I could share the demo code if anyone would find it useful (I need to tidy it up a little bit, and can’t be bothered otherwise) – just post in the comments!

Messing with Direct2D

As I blogged about previously, the main bottleneck in the oscilloscope code was in the drawing of the charts, in my previous post I modified DynamicDataDisplay to optimise this, but I wanted to make a Metro version of my app, and didn’t want to convert that library over, so I thought I’d look at doing the drawing myself.

First off,  on Windows 7/.NET4 I used the SurfaceInteropHelper  which allows Direct2D to be hosted on a WPF control, to do the drawing, and found that the CPU usage dropped from 70% to just under 40% (the code’s not worth showing here, and my use of Direct2D was pretty simplistic).

For the Windows 8 consumer preview, I created an app that used the SwapChainBackgroundPanel . It wasn’t obvious how to do this in an app created from scratch, so I butchered the SDK sample Metro style DirectX  shooting game sample (XAML)

This control allows Direct2D and XAML to be mixed in a metro app, but the control needs to be at the root element. I can see how this would be OK for an app such as a game where it could make use of overlayed controls, but for an app such as mine where I’d like to have multiple charts drawn using Direct2D it’s not ideal. Also, it seems to force the skeleton of the app to be written in C++ (where it would be nice to have a C# app host the controls, with the C++ –hidden- ).

This was awkward, but not a blocker to implementing the app on Metro… the blocker was to do with accessing the microphone. There are new Media Foundation libraries to do that, which do interoperate with the new C# async features than DirectShow (expose awaitable types):


The new version of RX (2.0 beta) has a new CreateAsync method instead of having an overload of Create that takes the async lambda).

The code above is untested, as the blocker is in the new MediaCapture library:


There’s no way to create a lossless MediaEncodingProfile from which to extract the raw waveform.

I think I’ll leave this app on the backburner until at least a later beta.

TPL Dataflow

This blog post is coming a couple of weeks late due to my copy of Windows becoming corrupted – and strangely the only things I didn’t have backed up were this Oscilloscope app!

Anyway, last time I took a look at charting performance, but this post will investigate TPL Dataflow.

In my original F# post, I discussed that I felt that the code would have been better implemented in RX, which I then went and did in this post. However, RX isn’t the only recent new technology coming out of Microsoft that deals with streams of data. TPL Dataflow has a confusingly similar overlap with the usages of RX, and this is what its whitepaper has to say:

“Astute readers may notice some similarities between TPL Dataflow and Reactive Extensions (Rx), currently available as a download from the MSDN Data developer center. Rx is predominantly focused on coordination and composition of event streams with a LINQ-based API, providing a rich set of combinators for manipulating IObservable<T>s of data. In contrast, TPL Dataflow is focused on providing building blocks for message passing and parallelizing CPU- and I/O-intensive applications with high-throughput and low-latency, while also providing developers explicit control over how data is buffered and moves about the system. As such, Rx and TPL Dataflow, while potentially viewed as similar at a 30,000 foot level, address distinct needs. Even so, TPL Dataflow and Rx provide a better together story.”

That does sound very interesting – who doesn’t want better performance!


I’ll dive in straight away and look at some code. The code is structured similarly to the RX code in my previous post.


The microphone still provides an IObservable<float[]> via the GetStream() method. The TaskSchedulers are specified explicitly throughout, so that the code can be unit tested (I’m using a MockSynchronizationContext).

The transformManyBlock behaves in the same way to the SelectMany operator, it separates the float array into its constituent parts.

The broadcast block is analogous to a connectable observable – many blocks can be linked to it to receive the same messages.

The transform many block is then explicitly linked to the broadcast block, meaning that its outputs make it into the broadcast block’s input.

Oscilloscope/Buffered View

The code to deal with the buffered display of data is:


The batch block behaves the same as LINQ’s Buffer operator (a bit confusingly when TPL Dataflow’s Buffer means something different). This simply takes the incoming float values and packages them into an array.

The ActionBlock is the final point on the chain – it takes the values it receives and outputs them to the screen (via oscilloscopeOutput) – this is the equivalent code to the code in the Observable Subscription in the RX version of my code.

Sliding Window View

Again, the sliding window view is structured similarly to the previous RX code:


A sample block is created so that every n samples are fed into the SlidingWindowBlock. This returns a new float[] array representing the new window of data on every input.

The second sample block ensures that not every window of data is drawn (so that the chart is refreshed at 40fps rather than 400fps!)

Finally, the observable is subscribed to, and the messages are posted into the transformManyBlock:


Whereas the RX guidelines recommend creating new operators based on existing operators, TPL Dataflow encourages building up custom blocks (though they can be built out of combining existing blocks).

The SlidingWindowBlock is a slightly modified version of the one provided in the white paper, and here ( The one in the example only posts a window once it has received a full window worth of data, whereas I want to start drawing the windows as soon as data arrives.


The SampleBlock is simple, it yields a value once it has received enough inputs, discarding the intermediate ones. This is structured more efficiently than my RX implementation – that one buffers the values, and then only take the last value out of the buffer.



I removed the updating of the chart, to look purely at what the performance of the app is in taking the input from the microphone, and shaping it to be in the correct format for output.

RX Performance:


TPL Dataflow:


The processor usage between the apps is in the same ballpark, but the TPL Dataflow app has many more garbage collections, and # bytes in all heaps also is running slightly higher. I had a quick look in the profiler, and it seems that there are many Task allocations from the Dataflow.Post() operation from within the SlidingWindow and Sample implementations.

The extra generation 1 collections don’t really matter for this application with the functionality that it currently has – the % time in GC barely registers, and there are no Gen 2 garbage collections over the minute or so that the app was running.

Once the app is doing more calculations and drawing more complex charts it would be interesting to see whether any latency spikes due gen 2 GCs cause similar slowdowns to the one discussed at the start of my previous post.

It would be an interesting exercise to limit the amount of GCs throughout the app, for instance there’s no need for the microphone access observable to return an IObservable<float[]> instead of IObservable<float>; currently every read from the microphone allocates a new float[]. Similarly, new List<Point> are created to more easily interface with DynamicDataDisplay – it would be better to change the types of data that D3 takes to be more observable-friendly, and to save having so many allocations. Again, there’s not much point doing this, other than an interesting theoretical exercise, until the garbage collection overhead proves to be a performance issue.


For an application as simple as this, there isn’t any benefit to using TPL Dataflow – it is a powerful library, with functionality such as blocks being able to decline offered blocks, and request them later, which would be difficult to implement in RX. As my app doesn’t (currently) need that level of functionality, there’s no benefit to using the library.

I may revisit this in the future – if I had some computationally expensive operation (FFT for instance) where I’d want greater control over the flow of data through the system.