This blog is starting to look very inconsistent – my last blog post was talking about starting to write a game, and now I’ve gone onto a totally different topic. Due to time the game’s gone onto, if not the back burner, then a gentle simmer. I’ve got to balance doing *cool stuff* with keeping on top of various technologies that may be relevant as a contractor, and having a life!

This blog will describe using C++ AMP for calculating Pi; the calculation of Pi is a fairly good example of Monte Carlo calculations, as the algorithm’s simple, and we all know the result.

First off, this is what the algorithm looks like in C#:

It simply finds the number of circles representing x-y coordinate pairs, whose vector magnitude falls within the unit circle. This ratio is Pi / 4.

The random class is the Mersenne Twister code from here: http://takel.jp/mt/MersenneTwister.cs. The results are:

Replacing the Mersenne Twister class with System.Random resulted in code which was approximately 30% slower. I’m running this on a dual-core laptop, but have not parallelised it, as I didn’t fancy porting over a Parallel Random Number Generator (PRNG) myself.

Tomas Petricek has an example of using the GPU to calculate Pi using F# here, but his example generates the random numbers on the CPU and uploads them to the GPU to do the calculations and reduction.

### C++ AMP Version

Microsoft have just released a C++ AMP Random Number generator library to codeplex http://amprng.codeplex.com/, but I’m using BharathM port of the parallel Mersenne Twister described here: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/20/mersenne-twister-sample-using-c-amp.aspx

His example generates random numbers into a rank 2 array of size 4096 * 2, but I’ve modified g_n_per_RNG to be 256.

The first thing I do is to pair up those random numbers into x-y coordinates and to find the square of their magnitude:

Then I determine whether the magnitude of these coordinates fall outside of the unit circle:

The final step is to do a parallel reduce of the resulting array. This is using the just-released reduction method from the C++ AMP algorithms library: http://ampalgorithms.codeplex.com/

The reduce function behaves similar to STL’s std::accumulate in that you can pass in the binary operation to perform on each item. The function takes a rank 1 array, so I’m using view_as to change rank.

To perform the timings I run the algorithm twice, once to warm up (ensure all shaders etc are compiled – see http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/1c6c9f04-1f9f-4f44-99f3-154d991ae5ba )

The results I obtained are:

Calculating the result for half a million random numbers took approximately 16 milliseconds, whereas for the single threaded C# code a million iterations took 44 ms, so accounting for using both cores the performance seems pretty similar. This is quite disappointing, even though the GPU isn’t the most powerful (a mobile 5470), the CPU isn’t the most powerful either, so I was hoping for near an order of magnitude speed up. I wasn’t doing anything clever with tiling, and there may be other bits of the API I’m not using correctly. I’ll have to get hold of Kate Gregory’s book and try with different examples.

Hey Gary,

This all looks very cool. You get to do all the fun stuff!

Do you think you may be suffering from some false sharing issues? Like all the threads are writing to the same square_distances & value_outside_range arrays. Thats going to introduce a bit of contention. This has bitten me big time in the past. Check this out http://www.drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=217500206

Cheers

Indranil

Hi Gary

Where is the code so we can look at it and suggest changes. Code reviewing screenshots of partial code is not efficient 🙂

Also can you try increasing the size of the extent so you can get more threads/parallelism out of it?

Cheers

Daniel

Hi Daniel,

Thanks for your comments.

I hate blogger code formatting, so I've switched to posting in screengrabs, I know it sucks 😉

I actually spoke to you a few years ago at TechEd to ask you an inane question 🙂

Do you have an email address I can send the code sample to?

I did try using larger extents – but the aggregate method bombed out when my arrays got too large. I could have looked at aggregating sub-sections of the array and then aggregating those results, but I'd spent so long on this example already that I ran out of energy.

Thanks,

Gary