## Monte Carlo double barrier option pricing on GPU using C++ AMP

I converted the double barrier option pricing code from my last post to run on the GPU, using C++ AMP, and got an 80% speedup.

This is running on a low powered Acer V5-122p laptop with an AMD A6-1450 processor with an integrated Radeon HD 8250 GPU.

The gist is here: https://gist.github.com/taumuon/bbeeb9e2c1f5082a2699

To be fair, I converted the CPU code to be otherwise identical to the gpu code. Instead of populating an array of the sample path, it instead for each point just determines whether the value breaches the upper or lower barriers and uses the last value for the payoff.

This reduced the runtime from the code in my last blog post, of 2540ms, to 1250ms, i.e. a 2x speedup.

The GPU code was ran twice, the first time it ran in 140ms, the second run (after all shaders already compiled etc) was 15.6ms, i.e. a very impressive 80x speedup from the CPU code.

If anything, it shows that AMDs strategy of cheap low powered laptop chips will payoff if people start taking advantage of the relatively strong GPU.

## Monte Carlo C++ AMP

This blog is starting to look very inconsistent – my last blog post was talking about starting to write a game, and now I’ve gone onto a totally different topic. Due to time the game’s gone onto, if not the back burner, then a gentle simmer. I’ve got to balance doing cool stuff with keeping on top of various technologies that may be relevant as a contractor, and having a life!

This blog will describe using C++ AMP for calculating Pi; the calculation of Pi is a fairly good example of Monte Carlo calculations, as the algorithm’s simple, and we all know the result.

First off, this is what the algorithm looks like in C#:

It simply finds the number of circles representing x-y coordinate pairs, whose vector magnitude falls within the unit circle. This ratio is Pi / 4.

The random class is the Mersenne Twister code from here: http://takel.jp/mt/MersenneTwister.cs. The results are:

Replacing the Mersenne Twister class with System.Random resulted in code which was approximately 30% slower. I’m running this on a dual-core laptop, but have not parallelised it, as I didn’t fancy porting over a Parallel Random Number Generator (PRNG) myself.

Tomas Petricek has an example of using the GPU to calculate Pi using F# here, but his example generates the random numbers on the CPU and uploads them to the GPU to do the calculations and reduction.

### C++ AMP Version

Microsoft have just released a C++ AMP Random Number generator library to codeplex http://amprng.codeplex.com/, but I’m using BharathM port of the parallel Mersenne Twister described here: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/20/mersenne-twister-sample-using-c-amp.aspx

His example generates random numbers into a rank 2 array of size 4096 * 2, but I’ve modified g_n_per_RNG to be 256.

The first thing I do is to pair up those random numbers into x-y coordinates and to find the square of their magnitude:

Then I determine whether the magnitude of these coordinates fall outside of the unit circle:

The final step is to do a parallel reduce of the resulting array. This is using the just-released reduction method from the C++ AMP algorithms library: http://ampalgorithms.codeplex.com/

The reduce function behaves similar to STL’s std::accumulate in that you can pass in the binary operation to perform on each item. The function takes a rank 1 array, so I’m using view_as to change rank.

To perform the timings I run the algorithm twice, once to warm up (ensure all shaders etc are compiled – see http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/1c6c9f04-1f9f-4f44-99f3-154d991ae5ba )

The results I obtained are:

Calculating the result for half a million random numbers took approximately 16 milliseconds, whereas for the single threaded C# code a million iterations took 44 ms, so accounting for using both cores the performance seems pretty similar. This is quite disappointing, even though the GPU isn’t the most powerful (a mobile 5470), the CPU isn’t the most powerful either, so I was hoping for near an order of magnitude speed up. I wasn’t doing anything clever with tiling, and there may be other bits of the API I’m not using correctly. I’ll have to get hold of Kate Gregory’s book and try with different examples.

## Quick Play with the AMD Stream SDK

I mentioned in my last blog post that I was disappointed with the performance of Microsoft Accelerator, and wanted to play around with Brahma. I was going to do this sooner, but have been side-tracked with playing around with XNA on Windows Phone 7.

I downloaded the latest OpenCL version of Brahma, but had trouble with the nested loops and aggregation operations (force summations), so didn’t get as far as I’d hoped. It’s a shame, as the concept of LINQ to GPU is a great one.

I then took a look at running the OpenCL NBody simulation from the Stream SDK. I couldn’t get the simulation to run using the GPU despite trying various Catalyst versions, it failed with a runtime error message "This OpenCL build requires verison 1.4.879, version 1.4.696 installed", but in spite of this, I was impressed with the performance of using the Stream SDK, even running on the CPU.

Whereas my managed CPU-version of the nbody simulation achieved 5 fps (frames per second), (or 8 fps with the drawing disabled – as discussed earlier the WPF drawing code is slow), drawing 2000 bodies, the OpenCL SDK ran at 25 fps drawing 2048 bodies, i.e. a factor of 5 speedup. I didn’t bother to parallelise my code but the theoretical maximum speedup on my dual core machine would obviously be a factor of 2, so that’s a factor of 2.5 speedup using the Stream SDK on the same hardware.

I switched the Stream SDK NBody example to use the nBodyCPUReference() method to see whether it’s slow because of the difference between managed and native code, and it runs at 5 fps compiled native on the CPU, i.e. in the same ballpark as the managed version. As it’s not running on the GPU, the Stream version must be faster than the vanilla C++ version because it’s making use of the processor’s vector hardware, but I can’t be bothered to manually code the SSE intrinsics to see if that’s the case (but it might be cool to play around with Mono.SIMD if I get time).

Oh, I suppose I should talk about how the code looks – the guts of the algorithm doesn’t look much different between the vanilla C++ and the OpenCL version, but there is a lot of hideous boilerplate/setup code different between the two. This is why it’d be great to get a workable managed library to hide all this (alternately, it’ll be interesting to see whether C++ AMP abstracts away the OpenCL/DirectCompute complexity).

## GPGPU–playing with Microsoft Accelerator

It’s probably being screamingly obvious to some readers that the boids simulations I’ve been playing with are embarassingly parallel, so I thought I’d have a quick play.

I’ve been reading around about OpenCL and CUDA, but as there’s a Microsoft library with a .NET API for easily programming the GPU, I thought I’d have a play with Accelerator (another interesting .NET GPGPU library is Brahma – I might get around to playing with that one day). Accelerator is higher-level, no need to worry about the low-levels of the GPU memory management.

I decided to play with a simpler example than the boids, to focus on the technology instead of the problem domain. I chose to look at the all-pairs NBody simulation (see more info here).

I quickly coded up a simple example using 1000 bodies. The CPU was able to draw at approx 15 frames per second (I didn’t bother parallelising the simulation on the CPU, as I was hoping for an order of magnitude increase in speed on the GPU). WPF is incredibly slow in drawing, and I found (unexpectedly) that using DrawingVisuals to be even slower. For that reason, I’m only drawing 100 bodies, but all of them are included in the simulation. I was intending to reduce the bottleneck by using Direct2D, and then getting Accelerator to write out to texture memory to save transferring data over the bus.

I didn’t get the results I expected when using Accelerator – I first began by converting the main simulation routine (integration of positions) onto the GPU, and left the all-body force calculation on the CPU. I was surprised to find the simulation slower – I was hitting frame rates of 10 fps.

I guessed that maybe it was maybe transferring too much data between the CPU and GPU, so I then moved onto the force calculation. I was very surprised to find that this made the simulation orders of magnitude slower (i.e. hitting frame rates < 0.01 fps). I profiled this to find that the majority of the time was spent in CompileShader. This isn’t so surprising – I was building up the same calculation for each body, for each frame.

Following the advice in the Accelerator Programmers Guide, I then moved onto using Parameter Objects. This means that it’s able to use the same computation graph with different input data. This did help, but only by an order of magnitude. It’s still not approaching anywhere near real-time frame rates.

I can’t remember where I read it, but I read that it’s recommended using input data sizes of the order 1e6 elements to overcome the overhead of transferring data to and from the GPU. This does make sense, but I was expecting to be at least getting interactive frame rates (as the OpenCL simulations are obtaining). It may be that Accelerator is faster than the CPU with a large number of elements, but it may be that e.g. instead of rendering a frame in an hour, it takes five minutes. It doesn’t seem to be suitable for interactive simulations.

This could be a simple case of user-error. I’ve got the code available on taumuon. If I’m missing something obvious, or you can get faster frame rates than the CPU, please post in the comments

(As I’m discussing performance I guess I should disclose the software and hardware specs. Running Windows x64, on a HP DV3 laptop – 4GB ram, dual core Pentium P6100, ATI Radeon 5470).