GPGPU–playing with Microsoft Accelerator

It’s probably being screamingly obvious to some readers that the boids simulations I’ve been playing with are embarassingly parallel, so I thought I’d have a quick play.

I’ve been reading around about OpenCL and CUDA, but as there’s a Microsoft library with a .NET API for easily programming the GPU, I thought I’d have a play with Accelerator (another interesting .NET GPGPU library is Brahma – I might get around to playing with that one day). Accelerator is higher-level, no need to worry about the low-levels of the GPU memory management.

I decided to play with a simpler example than the boids, to focus on the technology instead of the problem domain. I chose to look at the all-pairs NBody simulation (see more info here).

I quickly coded up a simple example using 1000 bodies. The CPU was able to draw at approx 15 frames per second (I didn’t bother parallelising the simulation on the CPU, as I was hoping for an order of magnitude increase in speed on the GPU). WPF is incredibly slow in drawing, and I found (unexpectedly) that using DrawingVisuals to be even slower. For that reason, I’m only drawing 100 bodies, but all of them are included in the simulation. I was intending to reduce the bottleneck by using Direct2D, and then getting Accelerator to write out to texture memory to save transferring data over the bus.

I didn’t get the results I expected when using Accelerator – I first began by converting the main simulation routine (integration of positions) onto the GPU, and left the all-body force calculation on the CPU. I was surprised to find the simulation slower – I was hitting frame rates of 10 fps.

I guessed that maybe it was maybe transferring too much data between the CPU and GPU, so I then moved onto the force calculation. I was very surprised to find that this made the simulation orders of magnitude slower (i.e. hitting frame rates < 0.01 fps). I profiled this to find that the majority of the time was spent in CompileShader. This isn’t so surprising – I was building up the same calculation for each body, for each frame.

Following the advice in the Accelerator Programmers Guide, I then moved onto using Parameter Objects. This means that it’s able to use the same computation graph with different input data. This did help, but only by an order of magnitude. It’s still not approaching anywhere near real-time frame rates.

I can’t remember where I read it, but I read that it’s recommended using input data sizes of the order 1e6 elements to overcome the overhead of transferring data to and from the GPU. This does make sense, but I was expecting to be at least getting interactive frame rates (as the OpenCL simulations are obtaining). It may be that Accelerator is faster than the CPU with a large number of elements, but it may be that e.g. instead of rendering a frame in an hour, it takes five minutes. It doesn’t seem to be suitable for interactive simulations.

This could be a simple case of user-error. I’ve got the code available on taumuon. If I’m missing something obvious, or you can get faster frame rates than the CPU, please post in the comments

(As I’m discussing performance I guess I should disclose the software and hardware specs. Running Windows x64, on a HP DV3 laptop – 4GB ram, dual core Pentium P6100, ATI Radeon 5470).