I will actually try to make this asomewhat more general question, because my code have become quite large to post it here.
My problem is I have a implementation of a Radix sort algorithm based on paper by Mark Harris. The thing is that the code works fine for grids of sizes 14 (1D) . The thing is that 14 blocks is exactly the amount of multiprocessors that my GTX 470 has.
But when I use >14 blocks my data becomes random after sorting (sort is broken). I use 256 threads per block and each thread handle 4 elements. 14 blocks are then 14336 elements on array to be sorted. if I increase to 14337 (only one more element than 14336) => 15 blocks The sort totally collapses. I have also tried 15360 elements which is exactly 15 thread blocks but the sort is still broken.
Have anyone any idea of the cause of this? I have checked and tried to debug my code now for almost a whole week without success. I have debugged my kernels individually and it seems to work.
I appreciate any help, thx!