Dear all, I’ve been a very irregular CUDA/C++ user since 2009, so I wouldn’t exactly consider myself an expert of any sort.
For a somewhat unusual kind of point cloud renderer, I find myself in the situation of having to write to “heightfield-like” manifolds embedded in a 3D volume, i.e., read-modify-write buffer[x,y,z] where z = f(x,y). Here’s a mock-up of what I want to achieve:
#define _width 64
#define _height 64
#define _depth 256
#define _N 100
__global__
void render_kernel(float* frameBuf, float3* points, size_t numPoints) {
int x = threadIdx.x + blockDim.x * blockIdx.x;
int y = threadIdx.y + blockDim.y * blockIdx.y;
for (int j = 0; j < numPoints; ++j) { // Render contribution of point j to this thread's pixel
float3 p = points[j];
// Calculate z position as some function of (x,y) pixel coordinate and location of point j
int z = floor(min(_depth - 1, length(p - make_float3(x,y,0))));
// Increment buffer at that position
int index = x + gridDim.x * blockDim.x * (y + gridDim.y * blockDim.y * z);
frameBuf[index] += 1;
}
}
int main(){
float3 cpuPoints[N];
/* [initialize points somehow] */
float3* gpuPoints;
cudaMalloc((void **)&gpuPoints, N * sizeof(float3));
cudaMemcpy(gpuPoints, cpuPoints, N * sizeof(float3), cudaMemcpyHostToDevice);
float* gpuBuf;
cudaMalloc((void **)&gpuBuf, sizeof(float) * width * height * depth);
dim3 nblocks(width / 16, height / 16);
dim3 nthreads(16, 16, 1);
render_kernel <<<nblocks, nthreads>>> (gpuBuf, gpuPoints, N);
}
So every point contributes at all (x,y) locations in the buffer, but at an ever changing z-depth. The approach above reflects my intuition that it would seem natural to parallelize over pixels and loop over points. This way, all write access are guaranteed to be collision-free. However, this leads to the scattered write in line 18, which seems to be a real performance killer (commenting this line out or writing to a shared memory location gives a massive speedup).
The point set has some spatial coherence, so subsequent points are likely to contribute to nearby z locations. However, I haven’t had any luck trying to cache parts of the global buffer in shared memory.
I’d greatly appreciate any advice on how to best model this problem in CUDA.
Kind regards,
Matthias