2.5D read-modify-write on 3D volume

Dear all, I’ve been a very irregular CUDA/C++ user since 2009, so I wouldn’t exactly consider myself an expert of any sort.

For a somewhat unusual kind of point cloud renderer, I find myself in the situation of having to write to “heightfield-like” manifolds embedded in a 3D volume, i.e., read-modify-write buffer[x,y,z] where z = f(x,y). Here’s a mock-up of what I want to achieve:

#define _width 64
#define _height 64
#define _depth 256
#define _N 100

__global__ 
void render_kernel(float* frameBuf, float3* points, size_t numPoints) {
  int x = threadIdx.x + blockDim.x * blockIdx.x;
  int y = threadIdx.y + blockDim.y * blockIdx.y;
  for (int j = 0; j < numPoints; ++j) {  // Render contribution of point j to this thread's pixel
    float3 p = points[j];

    // Calculate z position as some function of (x,y) pixel coordinate and location of point j
    int z = floor(min(_depth - 1, length(p - make_float3(x,y,0)))); 

    // Increment buffer at that position
    int index = x + gridDim.x * blockDim.x * (y + gridDim.y * blockDim.y * z);
    frameBuf[index] += 1;
  }
}

int main(){

   float3 cpuPoints[N];
/* [initialize points somehow] */

   float3* gpuPoints;
   cudaMalloc((void **)&gpuPoints, N * sizeof(float3));
   cudaMemcpy(gpuPoints, cpuPoints, N * sizeof(float3), cudaMemcpyHostToDevice);

   float* gpuBuf;
   cudaMalloc((void **)&gpuBuf, sizeof(float) * width * height * depth);

   dim3 nblocks(width / 16, height / 16);
   dim3 nthreads(16, 16, 1);
   render_kernel <<<nblocks, nthreads>>> (gpuBuf, gpuPoints, N);

}

So every point contributes at all (x,y) locations in the buffer, but at an ever changing z-depth. The approach above reflects my intuition that it would seem natural to parallelize over pixels and loop over points. This way, all write access are guaranteed to be collision-free. However, this leads to the scattered write in line 18, which seems to be a real performance killer (commenting this line out or writing to a shared memory location gives a massive speedup).
The point set has some spatial coherence, so subsequent points are likely to contribute to nearby z locations. However, I haven’t had any luck trying to cache parts of the global buffer in shared memory.

I’d greatly appreciate any advice on how to best model this problem in CUDA.

Kind regards,
Matthias

It seems his approach is iterating over source data, then figuring out all the target data each source datum contributes to. It is often advantageous to process the other way around: iterate over the target data and figure out all the source data that contributes to each target datum.

I have trouble visualizing the access patterns for each approach in your context to see which may be the better approach. So for now treat my comment as food for thought.

Scattered writes are not per se inefficient if there is some spatial coherency and if one utilizes that and takes care that the atomic compute units are not ‘stressed’ too much. See e.g. the description of the warping routine at https://www.joanneum.at/uploads/tx_publicationlibrary/FAH-2010-GRAVISMA.pdf

I am not sure whether you provide here just a toy problem in terms of image size or whether you always have such ‘small’ images. In your examle, the image is of size 64 x 64. If you parallelize only over the pixels, you have only 4096 threads in total which is way too less in order to fully utilize a modern GPU (you might need at least 10 - 20 times as much threads or so).

So you should parallelize of x,y and also over ‘p’ (the points). Of coures that means that the incrementation in line 18 must be done atomically done. In order to do that efficiently, you might take advantage of ‘warp-aggregated atomics’ - see the function ‘atomicAggInc’ in in https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/

Note another possibility to increase parallelism would be to process several images in parallel, each image on a different CUDA stream. Or, use both strategies.