shared memory problem usage in variables

for the following simple cases,

main(){
dim3 dimGrid2(21,21), dimBlock2(3,3,33);

    for (int i=0;i<21*21;i++)  Accum[i] = 0.0 ;
    cudaMemcpy(Accum_d, Accum, sizeof(REAL)*21*21, cudaMemcpyHostToDevice); 

    Kernel <<<dimGrid2,dimBlock2>>>(Accum_d);

    cudaMemcpy(Accum, Accum_d, sizeof(REAL)*21*21, cudaMemcpyDeviceToHost);
    for (int i=0;i<21*21;i++) printf ("Accum[i]= %3d, %10.8f\n",i, Accum[i]);

}

global void Kernel(float Accum)
{
lmIndex = blockIdx.x + blockIdx.y
gridDim.x;
Accum[lmIndex ] += 1.0 ;
}

i will have strange numbers printed out and those numbers are different whenever i run the program.
i though each component of Accum must be 1.0.

am i misunderstanding how to use threads and blocks?

Please help and many thanks in advance.

Since [font=“Courier New”]ImIndex[/font] is the same for each thread of a block, you have 297 threads all trying to increment the same array element. As this increment is not done atomically, you can theoretically get any result between 1 and 297, depending on the exact timing. More likely, you will however find small integers between 1 and 10 as each warp of 32 threads will start from the same value and thus increment by one at most.

You need incorporate [font=“Courier New”]threadIdx[/font] into the index calculation to have each thread work on a different index to make the code work.

Since [font=“Courier New”]ImIndex[/font] is the same for each thread of a block, you have 297 threads all trying to increment the same array element. As this increment is not done atomically, you can theoretically get any result between 1 and 297, depending on the exact timing. More likely, you will however find small integers between 1 and 10 as each warp of 32 threads will start from the same value and thus increment by one at most.

You need incorporate [font=“Courier New”]threadIdx[/font] into the index calculation to have each thread work on a different index to make the code work.

Just noticed the title of your thread: How is shared memory involved here? Since you omitted the definitions of [font=“Courier New”]Accum[/font] and [font=“Courier New”]Accum_d[/font] - is [font=“Courier New”]Accum_d[/font] declared as shared memory?
If yes, this will simply not work, as (for one) shared memory is a per-multiprocessor resource, so a single copy is not sufficient to initialize it for all blocks.

Just noticed the title of your thread: How is shared memory involved here? Since you omitted the definitions of [font=“Courier New”]Accum[/font] and [font=“Courier New”]Accum_d[/font] - is [font=“Courier New”]Accum_d[/font] declared as shared memory?
If yes, this will simply not work, as (for one) shared memory is a per-multiprocessor resource, so a single copy is not sufficient to initialize it for all blocks.

Thanks, tera,

i learned that i need to make my treads work on a different index…hmmmm

does this mean, in my case, I have to make ‘lmIndex’ written in terms of threadIdx.x?

in my example, the value is going to be replaced with the sum of all components of the threads in the block. like the following,

lmIndex = blockIdx.x + blockIdx.y*gridDim.x;

{

sum calculation of all components of the current block

}

Accum[lmIndex] += sum ;

still, you are saying this is not going to work…am i right?

Thanks, tera,

i learned that i need to make my treads work on a different index…hmmmm

does this mean, in my case, I have to make ‘lmIndex’ written in terms of threadIdx.x?

in my example, the value is going to be replaced with the sum of all components of the threads in the block. like the following,

lmIndex = blockIdx.x + blockIdx.y*gridDim.x;

{

sum calculation of all components of the current block

}

Accum[lmIndex] += sum ;

still, you are saying this is not going to work…am i right?

thanks. i need to think about it. initialize inside the kernel, not outside. thanks.

thanks. i need to think about it. initialize inside the kernel, not outside. thanks.