Kernels fail to launch after a certain blockDim.x

Hi all,

I ported a 1-dimensional grid-based PDE solver for CPUs to GPUs with CUDA and everything is working just fine except for when I launch more than about 40 blocks. I’m using 128 threads per block, and the 1D grid is chopped into

int nblocks = (int)((N+((THREADSPERBLOCK-1)-1))/(THREADSPERBLOCK-1))

blocks (where N is the total grid size). Kernel launches look like



if ( cudaSuccess != cudaGetLastError() )

  printf( "Error: Kernel Failure - Loader_Kernel\n" )

For N = 5000 and below, the code runs fine. For 6000 and above, I get error output from the error handling routine shown above for a kernel devoted to reading memory (I use one kernel to load memory arrays, and another to handle the data, due to problems with block overlap). Does anyone know why this might occur? Previous versions of my code which did not perform entirely satisfactorily did not suffer from this problem so I do not believe it is an issue with lack of memory on my GPU (GTX 460). The program segfaults in the memory-reading kernel with the following cuda-memcheck message:

========= Invalid __global__ read of size 8

=========     at 0x00003720 in ../code_folder/

=========     by thread (127,0,0) in block (41,0,0)

=========     Address 0xffbfffff is misaligned


========= ERROR SUMMARY: 1 error

The memory-loading kernel does not give an “Error: Kernel Failure” or segfault.

Does anyone know what could cause this? I am pretty well in the dark here.



EDIT: clarity

We are pretty in the dark as well - can you post some code (particularly line 3087 of, where the misaligned access is happening)?

[s]Sorry for the lack of information; I was assuming that this problem was more general. The offending line is

vec[n] = pV[i+1][n] - pV[i-1][n];

specifically, calling the [i+1] element of pV. i is the array element index and n is used to loop over individual elements in a structure, in this case a struct of a few doubles. cuda-memcheck always gives the error for thread (127,0,0) in block (41,0,0). N = 6000 gives 48 blocks, but the same thread in the same block appears to be responsible for larger grids.

Note that pV is an array of pointers defined as

__constant__ double **pW;

and allocated via

double *pV_ptr;

  cudaErrorCheck(cudaMalloc((void**)&pV_ptr, gridsize*sizeof(double*)),

      "cudaMalloc - pV_ptr");

  cudaErrorCheck(cudaMemcpyToSymbol(pV, &pV_ptr,

      sizeof(double*)), "pV - cudaMemcpyToSymbol");

pV is initialized at the start of the memory-accessing kernel like this

if (i >= istart && i <= iend)

    pV[i] = (double*)&(V[i]);

where each thread has a unique i. For N = 7000 or something, the i corresponding to the thread which segfaults is comfortably between istart and iend.

Hope this reveals the problem, although it has not for me…


Oh dear how could I have been so myopic… Of course this leads to a race condition. Sorry to bother everyone.