Since [font=“Courier New”]ImIndex[/font] is the same for each thread of a block, you have 297 threads all trying to increment the same array element. As this increment is not done atomically, you can theoretically get any result between 1 and 297, depending on the exact timing. More likely, you will however find small integers between 1 and 10 as each warp of 32 threads will start from the same value and thus increment by one at most.
You need incorporate [font=“Courier New”]threadIdx[/font] into the index calculation to have each thread work on a different index to make the code work.
Since [font=“Courier New”]ImIndex[/font] is the same for each thread of a block, you have 297 threads all trying to increment the same array element. As this increment is not done atomically, you can theoretically get any result between 1 and 297, depending on the exact timing. More likely, you will however find small integers between 1 and 10 as each warp of 32 threads will start from the same value and thus increment by one at most.
You need incorporate [font=“Courier New”]threadIdx[/font] into the index calculation to have each thread work on a different index to make the code work.
Just noticed the title of your thread: How is shared memory involved here? Since you omitted the definitions of [font=“Courier New”]Accum[/font] and [font=“Courier New”]Accum_d[/font] - is [font=“Courier New”]Accum_d[/font] declared as shared memory?
If yes, this will simply not work, as (for one) shared memory is a per-multiprocessor resource, so a single copy is not sufficient to initialize it for all blocks.
Just noticed the title of your thread: How is shared memory involved here? Since you omitted the definitions of [font=“Courier New”]Accum[/font] and [font=“Courier New”]Accum_d[/font] - is [font=“Courier New”]Accum_d[/font] declared as shared memory?
If yes, this will simply not work, as (for one) shared memory is a per-multiprocessor resource, so a single copy is not sufficient to initialize it for all blocks.