Shared memory performance issue

I had a kernel that implements a KNN search on 3D spatial data for any given K up to 127.

The problem I encountered was: when using shared memory to store point indexes, if the kernel uses more than 8KB shared memory (doesn’t matter if the allocation is explicit or implicit via kernel launch configuration), the kernel performance is decreased.

Even if I don’t use all the shared memory (in case of explicit allocation).

I’ve modified the following kernel to test this issue.

__global__ void rearrangeBuffers(float4* o_pos, float4* o_norm,

unsigned* io_hashes, unsigned* io_indices, unsigned realSize) {

	__shared__ float4 points[256];

	const unsigned idx = threadIdx.x + (blockIdx.x * blockDim.x);

	const unsigned hash = io_hashes[idx];

	const unsigned myIndex = io_indices[idx];

	if (idx < realSize) {

		points[threadIdx.x] = tex1Dfetch(texPositions, myIndex);

		points[threadIdx.x + 128] = tex1Dfetch(texNormals, myIndex);

	}

	__syncthreads();

	if (idx < realSize) {

		o_pos[idx] = points[threadIdx.x];

		o_norm[idx] = points[threadIdx.x + 128];

		io_hashes[idx] = hash;

	}

}

The shared memory uses 256 * sizeof(float4) = 4096 bytes = 4KB. So, changing the allocation to:

__shared__ float4 points[513];

I got the usage of 513 * sizeof(float4) = 8208bytes > 8KB.

The original kernel runs in 80 microseconds (9800 GX2) for about 54K points (float4), with 128 threads per block.

The modified one runs between 90-100 microseconds.

The kernel concerning the KNN search, originally runs for the same data (54K points) in about 16ms for K=15 and with a kernel configuration of 128 threads per block. When K=16 (which causes the shared memory usage break the 8KB limit), the same kernel runs in 25ms. If I change the kernel configuration to 64 threads per block, the shared memory usage is decreased to less than 8KB and then the kernel runs in 19ms. I solved shared memory bank conflicts and the issue is still there…

Anyone has any clue on this issue? I’ve tested it on two GPUS: 8800GTX and 9800 GX2.

Thanks in advance.

Using more than 8KB per kernel limits the execution to 1 block per multiprocessor at all time since there is only 16KB total per MP.
So im guessing youre seeing the fallback of this. The scheduler is unable to hide memory fetches latency as it does not have any other blocks to juggle with.

Now I see! It totally makes sense. Thanks for answering me.

Anyway, I am not using shared memory anymore. I “solved” my problem declaring as many variables as I need (the drawback is to write more than one kernel to suit different K’s). For example, if K=16, I’ll call a kernel with 20 variables to hold the distances, instead of an array of 20 (which would be placed on local memory). The performance gains were awesome!

Now I can perform a KNN search with K=50 for each of those 54K points in no more than 50ms (with a thread block configuration of 128 threads, 9800 GX2).

That seems slow, just from a rough mental estimate, comparing to some of the tree builds and searches I do for raytracing (which isn’t KDD but still does a lot of “what’s near me” sorting.) What kind of strategy are you using? Building a KD tree then walking it for each point with a stack?

Gridding it and examining nearby voxels?

Oh sorry, I didn’t express myself clear… It is not just one KNN search.
For each point in the point cloud, I am searching its 50 neighbors on the very same point cloud.
So, I am performing 54,000 KNN searches (with K=50) in 50ms (my previous kernel ran in 163ms).
For a small comparison, when K=10, I get around 11ms in the search stage (it’s kinda linear).
The data structure is a grid, and I am constructing it in less than 15ms.

Actually you were clear. I was just doing some rough guesses to how long I’d expect it to take and I underestimated.

Checking with Mr. Anderson’s HOOMD molecular dynamics paper, he’s got a computational step which is generating neighbor lists

not of KNN but of all neighbors in a finite distance, but his average list size is about 30. For about 50K particles, HOOMD takes about 25ms,

so you’re in the same range. (see his figure 7) [And Mr Anderson can glare at me for comparing two algorithms which aren’t on the same hardware, not

generating the same data, have different domains and assumptions… ] But anyway, you’re “in the ballpark” for sure.