"CUDA Particles" improvement

I’m trying to find nearest-neighbors using the CUDA Particles approach (I use the sorting method).
I have 3 buffers (just like in the PDF):

  • Buffer P — containing unsorted particles
  • Buffer H — containing sorted cell-id — particle-id pairs (sorted by cell-id)
  • Buffer L — containing starting and ending positions (in buffer H) of particles with certain cell id
  • Now I want to use these buffers in a kernel A to find nearest neighbors of each particle.
    Currently I run the kernel A for all particles in P. One thread for the kernel A loops though all neighbors of the current particle.

    There’s a lot of scattered read operations involved (calculating all 27 neighboring grid cell ids and looping through all particles in the grid cells).

    for(int x=-1; x<=1; x++)
        for(int y=-1; y<=1; y++)
          for(int z=-1; z<=1; z++) {
            float3 neigh_position = curr_particle_pos + float3(x,y,z)*GRID_CELL_SIDE;
            // ugly boundary checking
            if ( dot(neigh_position<0,                  float3(1)) +
                 dot(neigh_position>BOUNDARY, float3(1))   != 0)
            int neigh_hash        = spatial_hash( neigh_position ); // linearly hashed discrete grid position
            int2 particles_range  = L[ neigh_hash ]; // the buffer L
            for(int p=particles_range.x; p<particles_range.y; p++)
              processed_value += heavy_computation( P[ H[p].y ] );

    How can I optimize the neighbor searching code above by using Z-index curve as grid-position hash?
    Can I iterate through a continuous range of memory containing all of neighboring grid cells (and hence their particles),
    instead of the scattered writes above (see the ideal code example below)?

    I’d be very grateful for any answers (even hints)!

    // The ideal code – it finds current particle's grid hash and based on that computes
      // a continuous range on Z-index curve into which fall all the neighbors,
      // resulting in non-scattered reads
      int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
      int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
      int last_particle_id  = L[ nearest_neighboring_cells_range.y ].y;
      for(int p=first_particle_id; p<=last_particle_id; p++) {
          processed_value += heavy_computation( P[ H[p].y ] );

    If sort them by their Z-index, it means that you’ll find the neighbours in somewhat close reads.

    Thank you for reply!

    So does that mean that just by changing the grid-position hash from the current,
    linear hash (i.e. hash = particle.x + particle.yGRID_SIDE + particle.zGRID_SIDE^2) to Z-index would instantly
    boost the performance (without changing the algorithm of iteration over neighbors)?