Overwriting arrays in global memory space

Hi,

yet another problem External Image !

The following code example explains my problem:

[codebox]global void test(float *A)

{

int vecBase = __mul24(blockIdx.x, blockDim.x);



A[vecBase + threadIdx.x] = threadIdx.x;

}

test<<<5, 5>>>(A);[/codebox]

A is a 5 by 5 matrix with random entries. I thought this code would produce five identical column vectors : (0 1 2 3 4)T, but the matrix remains untouched.

I have no idea why, are there any conflicts? What else do I have to consider?

Did you cudaMemcpy(…,…,…,cudaMemcpyDeviceToHost); after the execution of your kernel to copy the GPU’s results back to your system memory?

Yes, I did cudaMemcpy(). I already checked this.

The funny thing is that I can change the first five entries (A[0] - A[4]) of the matrix, so the first column is (0 1 2 3 4)T, but the others remain unchanged ?!

Perhaps, anybody can check this example code?

One more thing…

If I call test<<<6, 6>>(A), then 6 entries A[0] - A[5] are written. It seems it depends on the block count.

Can you post more of your code including the array declarations (host and device), the 2 cudamemcpy commands, and how you printf/cout the results?

I would just run it for you now, but I am at work and dont have cuda here.

[codebox]device:

global void test(float *A)

{

int vecBase = __mul24(blockIdx.x, blockDim.x);

A[vecBase + threadIdx.x] = threadIdx.x;

}

host:

float *h_A, *d_A;

h__A = (float*)malloc(55sizeof(float));

cudaMalloc((void **)&d_A, 55sizeof(float));

for(i = 0; i < 5*5; i++)

h_A[i] = 3.5f; //arbitrary

cudaMemcpy(d_A, h_A, 55sizeof(float), cudaMemcpyHostToDevice);

test<<<5, 5>>>(h_A);

cudaThreadSynchronize();

cudaMemcpy(h_A, d_A, 55sizeof(float), cudaMemcpyDeviceToHost);

for(i = 0; i < 5*5; i++)

printf(“%f\n”, h_A[i]);

[/codebox]

It’s no problem to read the entries.

For example I can assign:

[codebox]

if(blockIdx.x == 0 && threadIdx.x == 0)

A[0] = A[15];

BUT NOT

if(blockIdx.x == 0 && threadIdx.x == 0)

A[5] (or higher) = A[15];

[/codebox]

[codebox]

test<<<5, 5>>>(h_A);

[/codebox]

Are you in device-emu mode? You are passing the host pointer to the kernel. That should fail when not running in emu mode. (is that the case with integrated gpus…?)

the above code should be

[codebox]

test<<<5, 5>>>(d_A);

[/codebox]

Mmh, sorry copy and paste-mistake.

Of course, I called the kernel with d_A.

SHAME ON ME !!! :ph34r:

I only copied five entries back instead of the whole array! This shi… took me a day!

SORRY…

but sometimes you can’t see the forest for the trees!