Invalid memory read on shared memory

Hello Everybody!

I am trying to write a work efficient parallel scan based on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
I use a 16 element long array, containing only 1-s. It is an int list. The algorithm in gems have two phases, an up-sweep, and a down-sweep. Up-sweep works fine. Down-sweep stuck on the

temporary = shared_memory_ls[ai];

line. I used cuda-gdb and found out, that there are some invalid memory read, so I used memcheck. It says:

========= Invalid __shared__ read of size 4
=========     at 0x00000b70 in /home/hooger/Dokumentumok/prog/c/cuda_test.cu:49:exclusive_scan(int const *, int*, int)
=========     by thread (3,0,0) in block (0,0,0)
=========     Address 0x000000dc is out of bounds

and a bunch of host_backtrace. It happens for threads from (2,0,0) to (7,0,0).

I use Debian Linux current version (as of today), with cuda 6 and nvidia driver 340. These are the stable versions, so I give it a really low probability, that it’s a driver bug.

lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GF116 [GeForce GTS 450 Rev. 2] (rev a1)

Minimal example, producing the error:

#include <iostream>
#define LENGTH 16 // set the length of the array

__global__ void true_false(int* list, int len)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    while(idx < len)
    {
	list[idx] = 1;
	idx += gridDim.x * blockDim.x;	
    }
}

__global__ void exclusive_scan(const int* input, int* output, int len)
{
    extern __shared__ int shared_memory_ls[];
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    int offset = 1;
    int ai=0, bi=0,temporary=0;
    shared_memory_ls[2 * idx] = input[2 * idx]; //A
    shared_memory_ls[2 * idx + 1] = input[2 * idx + 1]; //A

    for(int d = (len / 2); d > 0; d /= 2)
    {
    	__syncthreads();
    	if(idx < d)
    	{
    	    ai = offset * (2 * idx + 1) - 1; // B
    	    bi = offset * (2 * idx + 2) - 1; // B
    	    shared_memory_ls[bi] += shared_memory_ls[ai];
    	}
    	offset *= 2;
    }

    if (idx == 0)
    {
    	shared_memory_ls[len - 1] = 0;
    } // C

    for(int d = 1; d < 2; d *= 2) /*It should be d<len, however it produceses the error and makes debugging easier */
    {
    	offset /= 2;
    	__syncthreads();
    	if(idx < d);
    	{
    	    ai = offset * (2 * idx + 1) - 1; // D
    	    bi = offset * (2 * idx + 2) - 1; // D

    	    temporary = shared_memory_ls[ai];
    	    shared_memory_ls[ai] = shared_memory_ls[bi];
    	    shared_memory_ls[bi] += temporary;
    	}
    }

    __syncthreads();
    output[2 * idx] = shared_memory_ls[2 * idx]; // E
    output[2 * idx + 1] = shared_memory_ls[2 * idx + 1]; // E
}

int main()
{
    cudaError_t errMsg;
    int* host_ls;
    int* device_ls_in;
    int* device_ls_out;
    host_ls = new int[LENGTH];
    cudaMalloc((void**)&device_ls_in, sizeof(int) * LENGTH);
    cudaMalloc((void**)&device_ls_out, sizeof(int) * LENGTH);

    for(int i = 0; i<LENGTH; ++i)
	host_ls[i] = 1;

    cudaMemcpy(device_ls_in, host_ls, sizeof(int) * LENGTH, cudaMemcpyHostToDevice);

    exclusive_scan<<<1, LENGTH/2, sizeof(int)*LENGTH>>>(device_ls_in, device_ls_out, LENGTH);
    errMsg = cudaMemcpy(host_ls, device_ls_out, sizeof(int) * LENGTH, cudaMemcpyDeviceToHost);
    if(errMsg)
	std::cout << cudaGetErrorString(errMsg) << std::endl;

    for(int i = 0; i<LENGTH; ++i)
	std::cout << host_ls[i] << ", ";
    std::cout << std::endl;	


    errMsg = cudaFree(device_ls_in);
    errMsg = cudaFree(device_ls_out);
    delete[] host_ls;
    return 0;
}

Both cudaFree produces the same error as the cudaMemcpy on line 76. Running the code on an 8 long array produces the excpected result with no errors (change line 2 and line 40 according to the comments to reproduce).

Any idea what am I doing wrong?

Wow, that took me a while to find.

You have an errant semicolon here:

if(idx < d);

it should be:

if(idx < d)

And the Sherlock Holmes Award goes to … :-)

One’s mind auto-corrects so one reads what the code’s author intended rather than what they actually wrote. This is why I prefer to place open curly braces on the same line as the condition rather than the next.

Geez… It is so embarassing! :( Thank you very much for the help guys!