memcheck errors on array initialization

I have a kernel in which each thread does some processing and sets values in an array. This kernel occasionally dies with an “unspecified launch failure”. Running under cuda-memcheck, a “Lane Illegal Address” is reported (and sometimes other errors, such as “Device Illegal Address” or “Warp Out-of-range Address”). I’ve tried running in cuda-gdb with memcheck on, and autostepping the kernel code to get at the context of the failure, but so far this stops on a line of code that only involves thread-local variables.

I have reduced the code to a very simple example that exhibits the problem; it is attached. This code does not produce consistent results; on each run, it does one of the following:

  • runs to completion with all array values correctly set
  • runs to completion, reporting that some array values have not been set (a different number of unset values each run)
  • dies with an unspecified launch failure

I believe that the array should be properly initialized without memory errors and crashing. Even in the cases in which it runs to completion, memory errors are reported by cuda-memcheck.

Can anyone see why this code might not consistently initialize the array elements? Or why it would encounter a “Lane Illegal Address”?

Running with cuda-memcheck produces some variant of the following. The addresses it reports as being out of range are, in fact, within the range of the allocated memory.

cuda-memcheck test
========= CUDA-MEMCHECK
num threads : 10240 (512 per block, 20 blocks per grid)
error in cudaMemcpy: unspecified launch failure
========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (20,0,0) in block (5,0,0)
========= Address 0x2000fbf4c is out of bounds

========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (65,0,0) in block (18,0,0)
========= Address 0x0058a594 is out of bounds

========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (78,0,0) in block (18,0,0)
========= Address 0x0058b9e4 is out of bounds

========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (83,0,0) in block (18,0,0)
========= Address 0x0058c1b4 is out of bounds

========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (87,0,0) in block (18,0,0)
========= Address 0x0058c7f4 is out of bounds

========= Invalid global write of size 4
========= at 0x00000170 in test.cu:15:doit
========= by thread (90,0,0) in block (18,0,0)
========= Address 0x0058cca4 is out of bounds

========= ERROR SUMMARY: 6 errors

This is on a GTX580 with CUDA 4.2, compiled to compute capability 2.0:

nvcc -o test test.cu -arch=compute_20 -code=sm_20
test.cu (1.33 KB)

[font=“Courier New”]numThreads[/font] is not a multiple of the blocksize, so you need to explicitly disable those threads inside the last block with [font=“Courier New”]i >= numThreads[/font]:

__global__ void doit(int *a, int num, int numPer, int numThreads)

{

    int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < numThreads) {

        int start = i * numPer;

        int finish = start + numPer;

for(int j=start; j<finish; j++ )

            if(j<num)

                a[j] = FLAG;

    }

}