I had a kernel that implements a KNN search on 3D spatial data for any given K up to 127.
The problem I encountered was: when using shared memory to store point indexes, if the kernel uses more than 8KB shared memory (doesn’t matter if the allocation is explicit or implicit via kernel launch configuration), the kernel performance is decreased.
Even if I don’t use all the shared memory (in case of explicit allocation).
I’ve modified the following kernel to test this issue.
__global__ void rearrangeBuffers(float4* o_pos, float4* o_norm,
unsigned* io_hashes, unsigned* io_indices, unsigned realSize) {
__shared__ float4 points[256];
const unsigned idx = threadIdx.x + (blockIdx.x * blockDim.x);
const unsigned hash = io_hashes[idx];
const unsigned myIndex = io_indices[idx];
if (idx < realSize) {
points[threadIdx.x] = tex1Dfetch(texPositions, myIndex);
points[threadIdx.x + 128] = tex1Dfetch(texNormals, myIndex);
}
__syncthreads();
if (idx < realSize) {
o_pos[idx] = points[threadIdx.x];
o_norm[idx] = points[threadIdx.x + 128];
io_hashes[idx] = hash;
}
}
The shared memory uses 256 * sizeof(float4) = 4096 bytes = 4KB. So, changing the allocation to:
__shared__ float4 points[513];
I got the usage of 513 * sizeof(float4) = 8208bytes > 8KB.
The original kernel runs in 80 microseconds (9800 GX2) for about 54K points (float4), with 128 threads per block.
The modified one runs between 90-100 microseconds.
The kernel concerning the KNN search, originally runs for the same data (54K points) in about 16ms for K=15 and with a kernel configuration of 128 threads per block. When K=16 (which causes the shared memory usage break the 8KB limit), the same kernel runs in 25ms. If I change the kernel configuration to 64 threads per block, the shared memory usage is decreased to less than 8KB and then the kernel runs in 19ms. I solved shared memory bank conflicts and the issue is still there…
Anyone has any clue on this issue? I’ve tested it on two GPUS: 8800GTX and 9800 GX2.
Thanks in advance.