Profiling GPU for N-body simulations

Hi, I’ve written a second order N-body simulation in 2-D. Each body is simulated by a thread on a GTX285 and requires 5 floats of global memory to store parameters such as location, velocity and magnitude.

Anyway profiling this against the same code on a CPU gives odd results (see attached graph) . The breakdown’s occur at 4800 bodies and 9600 bodies (which is a multiple of 20x the 240 cores on the card). Does anyone know why this is? Why the 20x factor?

Many thanks, David
performanceGPUOverCPU.png

It may be related to the number of “Blocks” that you spawn…

I suspect that you spawn different number of blocks for different problems (instead of constant blocks looping over the problem space)

Blocks are scheduled in batches… Sometimes it could so happen that the last batch does NOT have large number of blocks to saturate the GPU resulting in longer run-time for the last batch alone…

This last batch could bring down the overall speedup depending on the nature of the application…

This is what I could think of…

Yeah, that is what it looks like to me too.

Basically you get predictable runtime behaviour up to 30N blocks in the simulation (if N is the number of blocks per multiprocessor, with the GTX285 having 30 multiprocessors), but when you need to go to 30N+1 blocks, that requires up to double the time. because the GPU will run 30N blocks concurrently in a batch, then run 1 block in a second batch. As you increase the number of blocks beyond 30N, things again scale predictably as the second batch fills and the GPU reaches peak occupancy at the MP level again, until you hit 60N, and then the same problem arises.

avidday gave a good description of the issue.

Just to add my 2c, I will say that this effect is much easier to see if you plot the wall clock time needed for one time step vs the number of particles. That plot shows a characteristic stair step pattern with the width of the step being the number of concurrent threads that run (not necessarily 30 blocks: one MP can concurrently run more than one block).

Thanks all - You were right, I was using a block size of 20.

Best regards, David

That under-utilizes the hardware significantly. Block sizes should always be in multiples of the warp size (32).