Someone has posted a code snippet for an odd-even sort algorithm in the General CUDA forum, asking for help. I fixed some minor bugs in his kernel. Then I added an outer loop that calls the kernel multiple times such that I can sort correctly across multiple blocks and sort large arrays.
For input array up to size ~64k float elements, the algorithm running on an 8800GT card actually beats my CPU (Intel Q6600 CPU, quicksort algorithm on a single core) by a factor of 20. But as soon as I input, say 131k elements the performance drops inexplicably by about a factor of 1000. I would have expected a drop of factor 4 only because the algorithmic complexity is expected to scale quadratically.
Have a look here,
Would any CUDA expert have an explanation for this bizarre performance collapse?
Is there any cache between the device memory and the GPU that I am not aware of and that I accidentially started to spill by increasing input size?
UPDATE: I narrowed it down to about 80000 floats (about 320 kb of data) that I can sort on this 8800GT before performance begins to degrade sharply. This clearly is an effect of the nVidia hardware… but which part of the hardware could be responsible?