My CUDA program fails for larger grids

Hi

I will actually try to make this asomewhat more general question, because my code have become quite large to post it here.

My problem is I have a implementation of a Radix sort algorithm based on paper by Mark Harris. The thing is that the code works fine for grids of sizes 14 (1D) . The thing is that 14 blocks is exactly the amount of multiprocessors that my GTX 470 has.

But when I use >14 blocks my data becomes random after sorting (sort is broken). I use 256 threads per block and each thread handle 4 elements. 14 blocks are then 14336 elements on array to be sorted. if I increase to 14337 (only one more element than 14336) => 15 blocks The sort totally collapses. I have also tried 15360 elements which is exactly 15 thread blocks but the sort is still broken.

Have anyone any idea of the cause of this? I have checked and tried to debug my code now for almost a whole week without success. I have debugged my kernels individually and it seems to work.

I appreciate any help, thx!

Hi

I will actually try to make this asomewhat more general question, because my code have become quite large to post it here.

My problem is I have a implementation of a Radix sort algorithm based on paper by Mark Harris. The thing is that the code works fine for grids of sizes 14 (1D) . The thing is that 14 blocks is exactly the amount of multiprocessors that my GTX 470 has.

But when I use >14 blocks my data becomes random after sorting (sort is broken). I use 256 threads per block and each thread handle 4 elements. 14 blocks are then 14336 elements on array to be sorted. if I increase to 14337 (only one more element than 14336) => 15 blocks The sort totally collapses. I have also tried 15360 elements which is exactly 15 thread blocks but the sort is still broken.

Have anyone any idea of the cause of this? I have checked and tried to debug my code now for almost a whole week without success. I have debugged my kernels individually and it seems to work.

I appreciate any help, thx!

Hard to tell, but more blocks than mp’s would mean some blocks run after other blocks. This is a problem if there is data dependency between the blocks. In that case you would need more synchronization than __syncthreads() can offer. I’m guessing that this problem somehow does not show up when the blocks execute simultaneously, but don’t see an actual explanation.

Maybe provide the code as an attachment, so we can get some more evidence, for example, whether the mp limit reproduces on other gpu’s?

Further you should take a look at: http://code.google.com/p/back40computing/wiki/RadixSorting, has also been brought up in the forum, but can’t find the topic right now.

Hard to tell, but more blocks than mp’s would mean some blocks run after other blocks. This is a problem if there is data dependency between the blocks. In that case you would need more synchronization than __syncthreads() can offer. I’m guessing that this problem somehow does not show up when the blocks execute simultaneously, but don’t see an actual explanation.

Maybe provide the code as an attachment, so we can get some more evidence, for example, whether the mp limit reproduces on other gpu’s?

Further you should take a look at: http://code.google.com/p/back40computing/wiki/RadixSorting, has also been brought up in the forum, but can’t find the topic right now.

Hi Jan thanks for replying!

you do not mean this thread http://forums.nvidia.com/index.php?showtopic=175238&hl= ? I will appreciate if the thread could be found.

About data dependency. Do you mean executing an upcoming kernel and that kernel needs data that previous kernel is giving as result. If that is the case, isn’t synchronization of blocks done when a kernel finishes. Think I read it somewhere.

Unfortunately my code uses to much shared memory for other compute capabilities than 2.0. The implementation is bad I think and I have found some improvments on it that reduces the amount of shared memory => I can test on another GPU (comp cap 1.1), but then the code will be different from previous one. So if the problem still persist It will be more evident that it is the number of streaming multiprocessors that causes the sort to fail.

I have attached my inefficient code. It would be appreciated if all of you could read and try to understand and see if any logical errors are made. The prefix_sum_kernel.cu and ScanDispatcher.cu code is tested and works. Ignore the unneccesary __syncthreads() (there because of desperate debugging).
ScanDispatcher.cu (16.9 KB)
prefix_sum_kernel.cu (37 KB)
radix_sort.cu (44.7 KB)

Hi Jan thanks for replying!

you do not mean this thread http://forums.nvidia.com/index.php?showtopic=175238&hl= ? I will appreciate if the thread could be found.

About data dependency. Do you mean executing an upcoming kernel and that kernel needs data that previous kernel is giving as result. If that is the case, isn’t synchronization of blocks done when a kernel finishes. Think I read it somewhere.

Unfortunately my code uses to much shared memory for other compute capabilities than 2.0. The implementation is bad I think and I have found some improvments on it that reduces the amount of shared memory => I can test on another GPU (comp cap 1.1), but then the code will be different from previous one. So if the problem still persist It will be more evident that it is the number of streaming multiprocessors that causes the sort to fail.

I have attached my inefficient code. It would be appreciated if all of you could read and try to understand and see if any logical errors are made. The prefix_sum_kernel.cu and ScanDispatcher.cu code is tested and works. Ignore the unneccesary __syncthreads() (there because of desperate debugging).

Thanks for posting the code. I could not compile it because some classes/header files are not included.

There is a lot of code, bit much to grasp in a short time.

I have one hint, although perhaps you have considered this already. You used lots of __syncthreads() calls, some patently unneeded, but I didn’t see a __threadfence() call to synchronize access to globals between blocks, used, for example, in one pass reduction schemes. Perhaps checking programming guide B.5 and the threadFenceReduction example will give you the information you need to improve cooperation between blocks.

Thanks for posting the code. I could not compile it because some classes/header files are not included.

There is a lot of code, bit much to grasp in a short time.

I have one hint, although perhaps you have considered this already. You used lots of __syncthreads() calls, some patently unneeded, but I didn’t see a __threadfence() call to synchronize access to globals between blocks, used, for example, in one pass reduction schemes. Perhaps checking programming guide B.5 and the threadFenceReduction example will give you the information you need to improve cooperation between blocks.

thanks for the hint Jan.

I think that __threadFence() is not needed in my case because no thread makes any access of elements of a array (shared or global mem space) that some other thread has modified. threads from different blocks do not need data which other blocks have modified. Each thread processes a uniques set of array indexes. I might be thinking wrong. For the moment I am rewriting the sort. I will see if the problem still persist after rewriting. I will notify my conclusions.

Thx

thanks for the hint Jan.

I think that __threadFence() is not needed in my case because no thread makes any access of elements of a array (shared or global mem space) that some other thread has modified. threads from different blocks do not need data which other blocks have modified. Each thread processes a uniques set of array indexes. I might be thinking wrong. For the moment I am rewriting the sort. I will see if the problem still persist after rewriting. I will notify my conclusions.

Thx

Hi again. I have made some progression about my radix sort implementation. The newest version is working and makes less use of shared memory => less shared memory reads and writes.

The problem was that I had 1 global key and values array that I did scattering on and that seemed to mess up my data (cannot see why because scattering is a perfect permutation and I use register variable/sh mem as temporary arrays for scattering). These only arrays corrupted my sort when I had more blocks than mutliprocessors. I tried on my Geforce 9650M GT and Geforce GTX 470 card and both failed for more than their multiprocessor amount (>4 resp >14).

I did as CUDPP and introduced temporary key value arrays to perform 4 bit split operation (a scatter in the end) into these temporary arrays and in the end of 1 iteration of radix sort I scattered the temp vars back to original key value pairs. This ensured that my original key value arrays did not become corrupted.

Hi again. I have made some progression about my radix sort implementation. The newest version is working and makes less use of shared memory => less shared memory reads and writes.

The problem was that I had 1 global key and values array that I did scattering on and that seemed to mess up my data (cannot see why because scattering is a perfect permutation and I use register variable/sh mem as temporary arrays for scattering). These only arrays corrupted my sort when I had more blocks than mutliprocessors. I tried on my Geforce 9650M GT and Geforce GTX 470 card and both failed for more than their multiprocessor amount (>4 resp >14).

I did as CUDPP and introduced temporary key value arrays to perform 4 bit split operation (a scatter in the end) into these temporary arrays and in the end of 1 iteration of radix sort I scattered the temp vars back to original key value pairs. This ensured that my original key value arrays did not become corrupted.