A kernel's performance depends on sortedness of data, but sorting the data would take more time than the performance gained

What are standard optimizations in cases like this?

For example, I have an n-body simulator with 100-million particles and bottleneck is force-sampling which is 12 milliseconds on RTX5070. Algorithm is sampling densities from 2D lattice for each particle. But particles are randomly placed on terrain and at least 4 accesses are made per particle.

I’m not sure if sorting x,y,vx,vy (each float) data of 100 million elements (1.6 GB) saves 12 milliseconds on all GPUs because the device has around 600GB/s and it would take at least 2.7 milliseconds just to read each element, then another 2.7 miliseconds to write. Sorting would do a bit more than just reading and writing.

Kernel: CosmosSimulationWithCuda/CosmosCuda.cuh at e580cf54805840867dad72be9bf28518d11716a4 · tugrul512bit/CosmosSimulationWithCuda · GitHub

I’ve tried coarsening other inputs/outputs but there are totally randomized accesses to a ~128MB lattice buffer.

Maybe shared-memory can help a bit but particles would not be in same tile at a time. This makes it harder to compensate for the extra time spent on smem copy from global mem.

L1 cache can still help when a lot of particles access same location, but on sparse areas this wouldn’t help.

Without touching the particle data, what kind of optimizations can be done?

Have you looked into tree-based methods (the classical one being Barnes-Hut)? I have not worked on n-body code since the early days of CUDA, but a quick search at Google Scholar suggests that variants of Barnes-Hut are still being used on modern GPUs.

@njuffa

That still has to touch particle data, but I guess there’s no other choice than to use a tree/grid. It would also improve accuracy by calculating closest-neighbors more precisely (main intention for tree was to calculate closest neighbors better, postponed it for later), lowest-hanging fruit was this kernel for sampling data (and hoped maybe something could be done just inside kernel).

Edit:

Optimized the multi-sampling (gradient) by computing it before particle-lattice interaction. So now each particle is doing 2 memory accesses (x,y force vector) instead of 5 (gradient components) and the size of gradient can be increased independently from number of particles without extra slow.