A kernel's performance depends on sortedness of data, but sorting the data would take more time than the performance gained

tugrul_192bit · October 23, 2025, 9:28pm

What are standard optimizations in cases like this?

For example, I have an n-body simulator with 100-million particles and bottleneck is force-sampling which is 12 milliseconds on RTX5070. Algorithm is sampling densities from 2D lattice for each particle. But particles are randomly placed on terrain and at least 4 accesses are made per particle.

I’m not sure if sorting x,y,vx,vy (each float) data of 100 million elements (1.6 GB) saves 12 milliseconds on all GPUs because the device has around 600GB/s and it would take at least 2.7 milliseconds just to read each element, then another 2.7 miliseconds to write. Sorting would do a bit more than just reading and writing.

Kernel: CosmosSimulationWithCuda/CosmosCuda.cuh at e580cf54805840867dad72be9bf28518d11716a4 · tugrul512bit/CosmosSimulationWithCuda · GitHub

I’ve tried coarsening other inputs/outputs but there are totally randomized accesses to a ~128MB lattice buffer.

Maybe shared-memory can help a bit but particles would not be in same tile at a time. This makes it harder to compensate for the extra time spent on smem copy from global mem.

L1 cache can still help when a lot of particles access same location, but on sparse areas this wouldn’t help.

Without touching the particle data, what kind of optimizations can be done?

njuffa · October 23, 2025, 10:23pm

Have you looked into tree-based methods (the classical one being Barnes-Hut)? I have not worked on n-body code since the early days of CUDA, but a quick search at Google Scholar suggests that variants of Barnes-Hut are still being used on modern GPUs.

tugrul_192bit · October 23, 2025, 10:33pm

@njuffa

That still has to touch particle data, but I guess there’s no other choice than to use a tree/grid. It would also improve accuracy by calculating closest-neighbors more precisely (main intention for tree was to calculate closest neighbors better, postponed it for later), lowest-hanging fruit was this kernel for sampling data (and hoped maybe something could be done just inside kernel).

Edit:

Optimized the multi-sampling (gradient) by computing it before particle-lattice interaction. So now each particle is doing 2 memory accesses (x,y force vector) instead of 5 (gradient components) and the size of gradient can be increased independently from number of particles without extra slow.

github.com/tugrul512bit/CosmosSimulationWithCuda

CosmosCuda.cuh

042eae25b


      
                      const auto d2 = data2[i + j * Constants::N];
                      const float tmpX = d1.x * d2.x - d1.y * d2.y;
                      const float tmpY = d1.x * d2.y + d1.y * d2.x;
                      d1.x = tmpX;
                      d1.y = tmpY;
                      data[i + j * Constants::N] = d1;
                  }
              }
          }
          template<int N>
          __global__ void k_calcGradientLattice(const float* const __restrict__ lattice_d, float2* const __restrict__ latticeForceXY_d) {
              const int thread = threadIdx.x;
              const int block = blockIdx.x;
              const int numBlocks = gridDim.x;
              const int numThreads = blockDim.x;
              const int globalThread = thread + block * numThreads;
              const int numTotalThreads = numThreads * numBlocks;
              const int steps = (N*N + numTotalThreads - 1) / numTotalThreads;
              for (int ii = 0; ii < steps; ii++) {
                  const int index = ii * numTotalThreads + globalThread;
                  if (index < N * N) {

Topic		Replies	Views
Barnes-Hut CUDA Simulation Performance CUDA Programming and Performance cuda , kernel	14	412	October 28, 2025
Sorting particles into cells CUDA Programming and Performance	4	8915	February 1, 2011
Point Sorting into Cells CUDA Programming and Performance	9	3477	June 23, 2009
Is the performance of cuda radix-sort related to data itself? CUDA Programming and Performance cuda	13	517	July 7, 2024
Sorting bottleneck in rendering fractals a call for ideas CUDA Programming and Performance	19	3959	November 18, 2009
Range search in cuda? CUDA Programming and Performance	12	6710	June 23, 2009
CUDA & Smoothed Particle Hydrodynamics Best approach? CUDA Programming and Performance	19	10236	June 24, 2009
Optimization Advice CUDA Programming and Performance	7	19282	March 3, 2011
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7272	November 28, 2007
Implementing a fast sorting algorithm -- how? CUDA Programming and Performance	2	668	June 30, 2012

A kernel's performance depends on sortedness of data, but sorting the data would take more time than the performance gained

Related topics