Inexplicable performance drop when scaling input happening in odd-even sort algorithm


Someone has posted a code snippet for an odd-even sort algorithm in the General CUDA forum, asking for help. I fixed some minor bugs in his kernel. Then I added an outer loop that calls the kernel multiple times such that I can sort correctly across multiple blocks and sort large arrays.

For input array up to size ~64k float elements, the algorithm running on an 8800GT card actually beats my CPU (Intel Q6600 CPU, quicksort algorithm on a single core) by a factor of 20. But as soon as I input, say 131k elements the performance drops inexplicably by about a factor of 1000. I would have expected a drop of factor 4 only because the algorithmic complexity is expected to scale quadratically.

Have a look here,

Would any CUDA expert have an explanation for this bizarre performance collapse?

Is there any cache between the device memory and the GPU that I am not aware of and that I accidentially started to spill by increasing input size?

UPDATE: I narrowed it down to about 80000 floats (about 320 kb of data) that I can sort on this 8800GT before performance begins to degrade sharply. This clearly is an effect of the nVidia hardware… but which part of the hardware could be responsible?


I guess that’s because much more data is being stored in card local memory, not in shared.

I did some research into register spill. Maybe register spill does depend on the grid size somehow. More generated threads could mean less available registers per thread and so we spill registers into local memory.

But … apparently it is not the size of the grid when invoking the kernel that is causing the slowdown. I tried chunking down large grid invocations into many smaller pieces, each processing at most 80000 floats (that is about where performance starts to drop). But alas, even when using only many small kernel invocations the total performance went down the drain. By a factor somewhere between 100-1000.

I think shared memory use is also ruled out because I know exactly how much shared memory I need per block (4KB) and I am bounded by the 768k thread limit per SM only.

So, still, I am puzzled, looking for an explanation

Kernel calls are asynchronous.

I was therefore not measuring computation time but rather the time it took the asynchronous API function to start launching the kernel.

I’ve now included the memory copy operations in the measurement intervals and results now make sense.

Case closed, I made a beginner’s mistake.

Not a beginner’s mistake, we all forget. :-)

But this also means that your original speed estimates vs the CPU were probably wildly optimistic. the CPU will clearly win for “small” sorts just because of the GPU memory transfer overhead. For larger N, the GPU might win quite nicely.

Thanks for the followup, many people don’t bother with updating a thread with their answer!