Tracking Invalid read size and illegal memory access

Greetings,
currently I’m trying to implement genetic algorithm using CUDA. I use the code below to evaluate each individuals using a CUDA kernel.

__global__ void evaluate(int * population,
                         int * distance, 
                         int * cost,
                         int nTowns,
                         int * d_index)
{
    int sum = 0;
    int t0, t1, idx;

    idx = threadIdx.x + blockIdx.x * blockDim.x;

    for (size_t i = 1; i < nTowns; i++) {
        t0 = idx * nTowns + (i - 1);
        t1 = idx * nTowns + i;
        
        sum = sum + distance[population[t0] * nTowns + population[t1]];
    }
    t0 = idx * nTowns + nTowns - 1;
    t1 = idx * nTowns;
    cost[idx] = sum + distance[population[t0] * nTowns + population[t1]];

    d_index[idx] = threadIdx.x;
}

I occasionally got some errors from this code, like 2-3 times out of 100 runs. Then I tried using cuda-memcheck and I got these outputs:

GPUassert: an illegal memory access was encountered ga_tes_3a.cu 469
========= CUDA-MEMCHECK
========= Program hit cudaErrorIllegalAddress (error 77) due to "an illegal memory access was encountered" on CUDA API call to cudaDeviceSynchronize. 
...
GPUassert: unspecified launch failure ga_tes_3a.cu 469
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x000000e0 in evaluate(int*, int*, int*, int, int*)
=========     by thread (327,0,0) in block (7,0,0)
=========     Address 0x3c45c467c is out of bounds
...
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.

How can I track this error? Any idea of why is this happened?

I’m sorry if my English is bad.

For out-of-bounds addressing errors in kernel code (e.g. your “Invalid global read of size 4”) cuda-memcheck can localize the error to a specific line of your kernel code, if you compile with the -lineinfo switch.

This is an extremely useful debugging technique. A longer writeup/example is here:

http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

Once you know the line of code, and the thread/block ID’s:

=========     by thread (327,0,0) in block (7,0,0)

that may enough for you to figure out what is going on, or add some directed printf statements to your kernel code:

if ((threadIdx.x == 327) && (blockIdx.x == 7)) printf("....", ...);

Or you can use one of the cuda debuggers

I haven’t tried that, because actually in other runs, I’ve got 3 errors happened in different threads. I’m sorry, I guess I should have include these earlier.

========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x000000e0 in evaluate(int*, int*, int*, int, int*)
=========     by thread (358,0,0) in block (7,0,0)
=========     Address 0x3061c46fc is out of bounds
...
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x000000e0 in evaluate(int*, int*, int*, int, int*)
=========     by thread (333,0,0) in block (7,0,0)
=========     Address 0x3061c46fc is out of bounds
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x000000e0 in evaluate(int*, int*, int*, int, int*)
=========     by thread (237,0,0) in block (7,0,0)
=========     Address 0x3061c46fc is out of bounds

The only consistent things from those 3 are the memory address and the block where the thread resides. I have update the first post to include the memory address that is out of bounds. The errors in my first post happened in a 50 runs and 3 errors I mentioned in this post happened in another 100 runs. In both runs I’m using 8 blocks and 512 threads. The card is GT 640.

I actually have tried using cuda-gdb, but I have no idea how to print an array content using its memory address. Also, after finding out that another error happened because of different memory address, I’m not sure where to look at.

.