Hello ,
suppose we are using <<<16,32>>> call config for a kernal function, then is the below table correct?
[u] variable[/u] ------------------- [u]max value[/u]
blockIdx.x ------------------------ 15
blockDim.x ----------------------- 32
threadIdx.x ------------------ 31
Because, Im using above config and the kernal is executing 16*32 times in Emulation modes.
but in Debug and Release modes, it is executing 2 blocks ( some times 5 blocks and some times 8 blocks ).
Why is it so?
How do you measure the number of blocks that execute?
yes thats right. (but blockDim isn’t really a max value)
how do you know that?
or better how do you count them?
its hard to say why some of your thread blocks aren’t working.
I am pretty sure that they are executed, but perhaps they do nothing.
i can only guess what is happening
foo<<<16,32>>>( …, int* count );
global void foo( …, int* count )
{
…
*count = blcokIdx.x;
…
}
Now, print count variable.
All blocks are run in parallel, so you have a race condition. Count could randomly be any value from 0 to blockDim.x-1.
this does not work.
because it depends on which thread was the last to write to this variable.
so its random which value count has.
better try something like this:
if (threadIdx.x==0) *count++;
this means only thread with number 0 of each block increases the count variable.
so at the end the result should be the number of blocks.
and of course it would be better to use an atomic add
This still has a race condition between blocks and may not give you the correct answer.
Not just better, atomicAdd is the only way to do this correctly. :)
Can you give me some examples how to use “atomicAdd” please?
have a look at the programming guide (appendix C).
an example:
int count;
atomicAdd(&count,1);
this is the same like count++, but it is done as an atomic operation.
or a more general example:
x = x + y;
int x;
int y;
atomicAdd(&x, y);