I have my own simple 2D n-body sim that I play with in my own free time and I just got a second hand Titan V (partly because it has the double precision units) to play with.

CUDA performance is limited by default to the C2 power state so I disabled this with “NVIDIA Profile Inspector” to get some extra performance, I also max the fan (till I get a water block) and add 100MHz to the core clock.

I get 77 FPS with 65536 points single precision and 60 FPS (surprisingly) with double.

For CUDA I use managedCuda in C# which uses ptx files, I then interop the buffer with OpenTK for OpenGL rendering.

My question is this: my code processes the force between each possible pair of points twice and I have not figured out how to apply the computed force to both points at the same time without creating a parallel dependence, I don’t want to use complex highly optimized algorithms, just to solve this one problem in the simplest way possible.

Here’s my code:

```
__global__ void Compute(double2* p0, double2* p1, double2* v, int count){
auto i = blockIdx.x * blockDim.x + threadIdx.x;
double2 fd = {0.0F, 0.0F};
for(auto j = 0; j < count; ++j){
if(i == j) continue;
const auto dx = p0[i].x - p0[j].x;
const auto dy = p0[i].y - p0[j].y;
//if(dx==0 || dy==0)continue;
const auto f = 0.000000001F/(dx*dx + dy*dy);
fd.x += dx*f;
fd.y += dy*f;
}
p1[i].x = p0[i].x + (v[i].x -= fd.x);
p1[i].y = p0[i].y + (v[i].y -= fd.y);
}
```

Thanks.