Inexplicable performance drop when scaling input happening in odd-even sort algorithm

cbuchner1 · July 16, 2008, 5:36pm

Hi,

Someone has posted a code snippet for an odd-even sort algorithm in the General CUDA forum, asking for help. I fixed some minor bugs in his kernel. Then I added an outer loop that calls the kernel multiple times such that I can sort correctly across multiple blocks and sort large arrays.

For input array up to size ~64k float elements, the algorithm running on an 8800GT card actually beats my CPU (Intel Q6600 CPU, quicksort algorithm on a single core) by a factor of 20. But as soon as I input, say 131k elements the performance drops inexplicably by about a factor of 1000. I would have expected a drop of factor 4 only because the algorithmic complexity is expected to scale quadratically.

Have a look here,

[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

Would any CUDA expert have an explanation for this bizarre performance collapse?

Is there any cache between the device memory and the GPU that I am not aware of and that I accidentially started to spill by increasing input size?

UPDATE: I narrowed it down to about 80000 floats (about 320 kb of data) that I can sort on this 8800GT before performance begins to degrade sharply. This clearly is an effect of the nVidia hardware… but which part of the hardware could be responsible?

Christian

BarsMonster · July 16, 2008, 5:41pm

I guess that’s because much more data is being stored in card local memory, not in shared.

cbuchner1 · July 16, 2008, 5:50pm

I did some research into register spill. Maybe register spill does depend on the grid size somehow. More generated threads could mean less available registers per thread and so we spill registers into local memory.

But … apparently it is not the size of the grid when invoking the kernel that is causing the slowdown. I tried chunking down large grid invocations into many smaller pieces, each processing at most 80000 floats (that is about where performance starts to drop). But alas, even when using only many small kernel invocations the total performance went down the drain. By a factor somewhere between 100-1000.

I think shared memory use is also ruled out because I know exactly how much shared memory I need per block (4KB) and I am bounded by the 768k thread limit per SM only.

So, still, I am puzzled, looking for an explanation

cbuchner1 · July 17, 2008, 8:54am

Kernel calls are asynchronous.

I was therefore not measuring computation time but rather the time it took the asynchronous API function to start launching the kernel.

I’ve now included the memory copy operations in the measurement intervals and results now make sense.

Case closed, I made a beginner’s mistake.

SPWorley · July 17, 2008, 4:07pm

Not a beginner’s mistake, we all forget. :-)

But this also means that your original speed estimates vs the CPU were probably wildly optimistic. the CPU will clearly win for “small” sorts just because of the GPU memory transfer overhead. For larger N, the GPU might win quite nicely.

Thanks for the followup, many people don’t bother with updating a thread with their answer!

Topic		Replies	Views
Sorting performance peculiarity CUDA Programming and Performance	2	4256	March 5, 2009
help me understand `odd' performance CUDA Programming and Performance	5	1670	June 18, 2010
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Error question regarding CUDA Sort CUDA Programming and Performance	6	555	August 31, 2019
Strange performance by varying the block size CUDA Programming and Performance	6	704	February 21, 2020
Faulty sort algorithm. Please help! Odd even sort CUDA Programming and Performance	9	11536	July 26, 2008
Expected performance CUDA Programming and Performance	7	1189	April 12, 2013
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Increasing register usage without decreasing occupancy drops speed dramatically CUDA Programming and Performance	3	964	May 24, 2011
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2310	August 12, 2009

Inexplicable performance drop when scaling input happening in odd-even sort algorithm

Related topics