Monte Carlo double barrier option pricing on GPU using C++ AMP

I converted the double barrier option pricing code from my last post to run on the GPU, using C++ AMP, and got an 80% speedup.

This is running on a low powered Acer V5-122p laptop with an AMD A6-1450 processor with an integrated Radeon HD 8250 GPU.

The gist is here: https://gist.github.com/taumuon/bbeeb9e2c1f5082a2699

To be fair, I converted the CPU code to be otherwise identical to the gpu code. Instead of populating an array of the sample path, it instead for each point just determines whether the value breaches the upper or lower barriers and uses the last value for the payoff.

This reduced the runtime from the code in my last blog post, of 2540ms, to 1250ms, i.e. a 2x speedup.

The GPU code was ran twice, the first time it ran in 140ms, the second run (after all shaders already compiled etc) was 15.6ms, i.e. a very impressive 80x speedup from the CPU code.

If anything, it shows that AMDs strategy of cheap low powered laptop chips will payoff if people start taking advantage of the relatively strong GPU.

Monte Carlo pricing of Exotic Options in Functional Style C++

My last blog post was about pricing of exotic options using F#. In that post I said “Apart from the Monte Carlo code, the fact that in F# all functions are in the curried form means that composing the payoff functions using partial function application is beautiful. This would be lambda hell to replicate in C#.”

The OO version of this would probably be built via composition: having a class implementing a IPayoff interface (implemented for Call and Put), and that would be used in implementations of IPathPayof, which would be implemented by AsianArithmeticMean, DoubleBarrier etc. The use of OO is to pull out the common implementation, but there’s no state associated with any of these classes – it feels much nicer using function composition.

I’ve implemented a version in C++ using lambdas, and the functional composition is actually really nice. C++ has the std::bind method which means that any number of parameters can be partially applied even for functions not in the curried form.

The code can be found here: https://gist.github.com/taumuon/727c31f4a56518f91718

The most surprising thing was the speed increase – the equivalent algorithm in C++ is 20x faster than the F# code! For instance, running on my low-powered AMD A6-1450 powered laptop, the double barrier option took 45 seconds to calculate in F#, whereas it took 2.2 seconds in C++.

I did tweak the F# to instead of using Seq.unfold to create the asset path, it instead creates and mutates an array, and that reduced the running time by more than 25%.

EDIT: Jack Pappas has forked and modified the F# code so that its runtime is now in the same ballpark as the C++ code. The fork is here: https://gist.github.com/jack-pappas/d629a767823ca2f17967

I think there’s scope to speed up both the F# and C++, by introducing parallelism and investigating GPGPU, which I hope to do if I don’t get sidetracked.

Monte Carlo pricing of Exotic Options in F#

I’ve recently completed all exercises for the MOOC Monte Carlo methods in Finance, ran by the University of Madrid on Iversity. The course was excellent, both in its content and delivery.

The course started off by refreshing probability theory (cdf, pdf etc) and quickly moved onto generating random numbers, Brownian motion and stochastic differential equations followed by the pricing of options and calculating risk.
The final week’s homework is on the Monte Carlo calculation of a portfolio’s risk, using copulas with Student’s t distribution, but as that week’s homework is still running I’m blogging about the penultimate week’s homework.

This homework is about pricing path dependent options, and demonstrating how using a control variate can provide a better result for the same number of iterations (better meaning less error, i.e. with a smaller standard deviation). The homework further shows that when pricing a barrier
option, using an up-and-out option as the control variate produces better results than using a European option as the control variate.

The course delivery was extremely slick. The videos were filmed specially, this wasn’t a rehashing of a lecture in front of students (as seems to happen in some other online courses). The lectures were short, and to the point, managing to cram a lot of information in with zero fluff. I really appreciated this as I was watching on a daily commute.

The assessments were by multiple choice question, but with only one attempt, so it did end up actually feeling like a challenge. The questions were a mixture of maths and coding, with the balance moving more towards coding as the weeks went on.
The code was delivered in Matlab, which I had to get reacquainted to after a decade break. The main downside I found is that I ended up doing a fair bit of copy-paste coding, so completed some of the homeworks in .NET instead. I personally found the coding easy, and found the maths more challenging (in a good way).

Now, for the code – the F# code is available here: https://gist.github.com/taumuon/11302896

Apart from the Monte Carlo code, the fact that in F# all functions are in the curried form means that composing the payoff functions using partial function application is beautiful. This would be lambda hell to replicate in C#.

If this course runs again I’d wholeheartedly recommend it.

MOOCs

I’ve just enrolled on Iversity’s Monte Carlo Methods in Finance course, and have converted some of week 1’s Matlab demo code over to F# and Deedle: https://gist.github.com/taumuon/8602365

I spent the latter half of last year diving into various online courses. I completed Coursera’s Mathematical Methods for Quantitative Finance, and also followed along with a number of other courses to varying levels of completeness.

I completed three weeks assignments of Udacity’s Cuda programming course.  Week three was painful due to an error in the reference code, and week 4 was crashing due to a memory exception. I was using a GPU emulator in Ubuntu, and decided that it would be easier with real hardware. I watched the remaining videos and found the parallel algorithms explanations useful.

I completed the first two weeks of Coursera’s Scientific Computing. These were maths exercises, which I enjoyed and that’s what inspired me to do the Maths Methods for Quant Finance course. The Matlab exercises I was planning to do in F#, but left the course when other attendees were complaining that to complete the homework to the correct numerical accuracy you needed the exact version of Matlab the instructor was using, and they were unable to use Gnu Octave.

It is great that there’s so much free high standard material available. The fixed timescale nature of the course is a bit annoying – if work or life gets in the way one week it may make it impossible to catch up with the remainder of the course. I may get around to trying a course again next time it comes around though.

Monte Carlo C++ AMP

This blog is starting to look very inconsistent – my last blog post was talking about starting to write a game, and now I’ve gone onto a totally different topic. Due to time the game’s gone onto, if not the back burner, then a gentle simmer. I’ve got to balance doing cool stuff with keeping on top of various technologies that may be relevant as a contractor, and having a life!

This blog will describe using C++ AMP for calculating Pi; the calculation of Pi is a fairly good example of Monte Carlo calculations, as the algorithm’s simple, and we all know the result.

First off, this is what the algorithm looks like in C#:

clip_image001_thumb25255B125255D

It simply finds the number of circles representing x-y coordinate pairs, whose vector magnitude falls within the unit circle. This ratio is Pi / 4.

The random class is the Mersenne Twister code from here: http://takel.jp/mt/MersenneTwister.cs. The results are:

clip_image002_thumb25255B125255D

Replacing the Mersenne Twister class with System.Random resulted in code which was approximately 30% slower. I’m running this on a dual-core laptop, but have not parallelised it, as I didn’t fancy porting over a Parallel Random Number Generator (PRNG) myself.

Tomas Petricek has an example of using the GPU to calculate Pi using F# here, but his example generates the random numbers on the CPU and uploads them to the GPU to do the calculations and reduction.

C++ AMP Version

Microsoft have just released a C++ AMP Random Number generator library to codeplex http://amprng.codeplex.com/, but I’m using BharathM port of the parallel Mersenne Twister described here: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/20/mersenne-twister-sample-using-c-amp.aspx

His example generates random numbers into a rank 2 array of size 4096 * 2, but I’ve modified g_n_per_RNG to be 256.

clip_image00425255B425255D

 

 

The first thing I do is to pair up those random numbers into x-y coordinates and to find the square of their magnitude:

clip_image00625255B425255D

clip_image00825255B425255D

Then I determine whether the magnitude of these coordinates fall outside of the unit circle:

clip_image01025255B425255D

 

 

clip_image01125255B425255D

The final step is to do a parallel reduce of the resulting array. This is using the just-released reduction method from the C++ AMP algorithms library: http://ampalgorithms.codeplex.com/

clip_image01325255B425255D

The reduce function behaves similar to STL’s std::accumulate in that you can pass in the binary operation to perform on each item. The function takes a rank 1 array, so I’m using view_as to change rank.

To perform the timings I run the algorithm twice, once to warm up (ensure all shaders etc are compiled – see http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/1c6c9f04-1f9f-4f44-99f3-154d991ae5ba )

The results I obtained are:

clip_image015_thumb25255B125255D

Calculating the result for half a million random numbers took approximately 16 milliseconds, whereas for the single threaded C# code a million iterations took 44 ms, so accounting for using both cores the performance seems pretty similar. This is quite disappointing, even though the GPU isn’t the most powerful (a mobile 5470), the CPU isn’t the most powerful either, so I was hoping for near an order of magnitude speed up. I wasn’t doing anything clever with tiling, and there may be other bits of the API I’m not using correctly. I’ll have to get hold of Kate Gregory’s book and try with different examples.