I ported a 1-dimensional grid-based PDE solver for CPUs to GPUs with CUDA and everything is working just fine except for when I launch more than about 40 blocks. I’m using 128 threads per block, and the 1D grid is chopped into
int nblocks = (int)((N+((THREADSPERBLOCK-1)-1))/(THREADSPERBLOCK-1))
blocks (where N is the total grid size). Kernel launches look like
Kernel<<<nblocks,THREADSPERBLOCK>>> cudaThreadSynchronize(); if ( cudaSuccess != cudaGetLastError() ) printf( "Error: Kernel Failure - Loader_Kernel\n" )
For N = 5000 and below, the code runs fine. For 6000 and above, I get error output from the error handling routine shown above for a kernel devoted to reading memory (I use one kernel to load memory arrays, and another to handle the data, due to problems with block overlap). Does anyone know why this might occur? Previous versions of my code which did not perform entirely satisfactorily did not suffer from this problem so I do not believe it is an issue with lack of memory on my GPU (GTX 460). The program segfaults in the memory-reading kernel with the following cuda-memcheck message:
========= Invalid __global__ read of size 8 ========= at 0x00003720 in ../code_folder/code.cu:3087:Memory_Reading_Kernel ========= by thread (127,0,0) in block (41,0,0) ========= Address 0xffbfffff is misaligned ========= ========= ERROR SUMMARY: 1 error
The memory-loading kernel does not give an “Error: Kernel Failure” or segfault.
Does anyone know what could cause this? I am pretty well in the dark here.