2D n-body simulation optimization

I have my own simple 2D n-body sim that I play with in my own free time and I just got a second hand Titan V (partly because it has the double precision units) to play with.

CUDA performance is limited by default to the C2 power state so I disabled this with “NVIDIA Profile Inspector” to get some extra performance, I also max the fan (till I get a water block) and add 100MHz to the core clock.

I get 77 FPS with 65536 points single precision and 60 FPS (surprisingly) with double.

For CUDA I use managedCuda in C# which uses ptx files, I then interop the buffer with OpenTK for OpenGL rendering.

My question is this: my code processes the force between each possible pair of points twice and I have not figured out how to apply the computed force to both points at the same time without creating a parallel dependence, I don’t want to use complex highly optimized algorithms, just to solve this one problem in the simplest way possible.

Here’s my code:

__global__ void Compute(double2* p0, double2* p1, double2* v, int count){
	auto i = blockIdx.x * blockDim.x + threadIdx.x;
	double2 fd = {0.0F, 0.0F};
	for(auto j = 0; j < count; ++j){
		if(i == j) continue;
		const auto dx = p0[i].x - p0[j].x;
		const auto dy = p0[i].y - p0[j].y;
		//if(dx==0 || dy==0)continue;
		const auto f = 0.000000001F/(dx*dx + dy*dy);
		fd.x += dx*f;
		fd.y += dy*f;
	p1[i].x = p0[i].x + (v[i].x -= fd.x);
	p1[i].y = p0[i].y + (v[i].y -= fd.y);


Create enough threads for all pairs: N*(N-1)/2 threads for N particles.

index shall be the absolute thread identifier < N*(N-1)/2

Note: all indices (x,y,index) count from 0 and x,y < N

the following instruction sequence maps the thread index to particle indices x,y for which to compute the forces.

// Note: extremely large numbers for index may cause precision issues with the sqrt
unsigned int y = 1 + (unsigned int)( (sqrt(1+8index)-1)/2 );
unsigned int x = (index - (y

The following mapping works for storing any pairwise information in an array - or in your case it maps x,y to a thread index for the respective particle pair

// Note: x != y must hold
unsigned int x_ = min(x,y);
unsigned int y_ = max(x,y);
unsigned index = y_*(y_-1)/2 + x_;

I use the above logic to compute, store and retrieve pairwise information in arrays using the least amount of memory or threads.

The difficulty for your force calculations would be to prevent conflicts in accessing the forces or positions of individual particles, as multiple threads may be trying to modify positions for the same particles simultaneously. Locking mechanisms or use of atomics may help.

It’s possible to do linear to 2D triangular mapping without using recursion or floating point methods or square root:


(the comb_n functor, described in section 7 of the answer)

But if there is a thread for each of 0 and N for example how would the values each thread produces add up to a total to be assigned to 0, wouldn’t they each overwrite the value the last thread produced?

Also the last time I tried a thread for each pair it was very slow.

I cant figure out any of your suggestions, this is the only CUDA I have done and its literally just copied from my CPU version.

You could have one kernel summate the forces that affect each particle pair, using atomicAdd() on double variables. This atomicAdd() is available for compute capability 6.0 or better. Each calculated force for a particle pair would be added to the respective particles, with opposing sign. Note that the atomic operation will prevent multiple threads from overwriting each other’s result.

A second kernel would apply the calculated forces onto the individual particles.

Whether the use of N*(N-1)/2 threads combined with atomicAdd() is any faster than the use of NN threads remains to be seen. The memory access patterns are probably way more benign and coalesced in the NN case

Also there are probably way better optimizations for the N body problem than just removing redundant computations.

Recently neural networks have been applied successfully to such simulations, e.g.
https://arxiv.org/pdf/1708.08819.pdf and https://www.pnas.org/content/116/28/13825


Also consider this (older) article about optimizing N body simulations

and here’s a Mini-nbody simulation by Mark Harris, that may be suitable to learn various optimization techniques


I adapted the code from the nvidia nbody sample project:

__device__ float2 bodyBodyInteraction(float2 pi, float2 pj, float2 ai){
    float2 r;
    r.x = pj.x - pi.x;
    r.y = pj.y - pi.y;
    const auto f = 0.0000000001F / (r.x * r.x + r.y * r.y);
    ai.x += r.x * f;
    ai.y += r.y * f;
    return ai;
__device__ float2 computeBodyAccel(float2 bodyPos, float2* positions, int numTiles, class cg::thread_block cta){
    extern __shared__ float2* sharedPos;
    float2 acc = {0.0f, 0.0f};
    for (auto tile = 0; tile < numTiles; ++tile){
        sharedPos[threadIdx.x] = positions[tile * blockDim.x + threadIdx.x];
#pragma unroll 128
        for (unsigned int counter = 0; counter < blockDim.x; counter++){
            acc = bodyBodyInteraction(acc, bodyPos, sharedPos[counter]);
    return acc;
__global__ void integrateBodies(float2*__restrict__ oldPos, float2*__restrict__ newPos, float2* vel, unsigned int count){
    const class cg::thread_block cta = cg::this_thread_block();
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index >= count){ return; }
    auto position = oldPos[index];
    const auto accel = computeBodyAccel(position, oldPos, blockDim.x, cta);
    position.x += vel[index].x += accel.x;
    position.y += vel[index].y += accel.y;
    newPos[index] = position;

and all of the points at the new position are 0x0(middle of the screen) and if I try to access the device memory after its executed I get an exception saying something like the kernel failed, clearly the shared memory thing aint working right but I have zero experience with that stuff.

I went back to my original code and realized that if I add a very small number to the distance squared it eliminates the possibility of a divide by zero failure and it allows me to remove the if’s making it more optimization friendly which increased performance a bit.
Also unrolling by 32 increased performance a bit(32 works best) and that resulted in a 1.27x performance increase with 65536 points!

__global__ void Compute(float2* p0, float2* p1, float2* v, int count){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    float2 fd = {0.0F, 0.0F};
#pragma unroll 32
    for (auto j = 0; j < count; ++j){
        const auto dx = p0[i].x - p0[j].x;
        const auto dy = p0[i].y - p0[j].y;
        const auto f = 0.0000000001F/(dx*dx + dy*dy + 0.000000000001F);
        fd.x += dx * f;
        fd.y += dy * f;
    p1[i].x = p0[i].x + (v[i].x -= fd.x);
    p1[i].y = p0[i].y + (v[i].y -= fd.y);

I got the code I adapted from the toolkit sample to work and now I can do 1048576 points at 100 FPS… that’s freakin insane!!!

__global__ void Compute(float2*__restrict__ oldPos, float2*__restrict__ newPos, float2* vel, int count){
    const auto cta = cg::this_thread_block();
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index >= count){ return; }
    auto me = oldPos[index];
    __shared__ float2 them[64];
    float2 acc = {0.0F, 0.0F};
    for (auto tile = 0U; tile < blockDim.x; ++tile){
        them[threadIdx.x] = oldPos[tile * blockDim.x + threadIdx.x];
#pragma unroll 32
        for (auto counter = 0U; counter < blockDim.x; ++counter){
            float2 r;
            r.x = me.x - them[counter].x;
            r.y = me.y - them[counter].y;
            const auto f = 0.00000001F / (r.x * r.x + r.y * r.y + 0.000000000001F);
            acc.x += r.x * f;
            acc.y += r.y * f;
    me.x += vel[index].x -= acc.x;
    me.y += vel[index].y -= acc.y;
    newPos[index] = me;

I figured out why the new kernel is so fast, its because the force of each point is only calculated with the points in its block which results in a very inaccurate simulation.

I can’t imagine nVidia to provide an incorrect or inaccurate code as their sample project.

What I think they might be doing is that each thread block interacts a subset of particles with another subset. That allows the block to pre-load the data for just the required subsets into shared memory, interacting these particle efficiently before writing out the resulting (partial) force results. The acceleration occurs due to saving global memory access by efficiently loading/storing data only once.

If enough thread blocks are created to process all pairs of particle subsets, the result will be accurate, given that the forces computed by each thread block are getting summated correctly (e.g. by means of atomic instructions)

I can switch between the 2 kernels in realtime and with both locked at 60 FPS with 65536 points the kernel in post #8 forms dense clumps, then upon switching to the new kernel the dense clumps explode because the gravitational forces are less and therefore the velocities are too great and they no longer clump as densely, I also notice other ways in which the faster kernel is not quite right.

The accurate simulation is much more visually pleasing.