Hi to everyone,
I have a stupid question related to the use of __syncthread() function and to some CUDA example code that use it.
I know that if I use the __syncthread() in a warp that diverge, the kernel function goes in deadlock. But in some CUDA example, when people try to explain how write a good code when the dataset problem is’t a multiple of 32, i read this kind of code:
//dataset 1M
// Id of thread in the block.
int localId = threadIdx.x;
// The first index of output element computed by this block.
int startOfBlock = blockIdx.x * blockDim.x;
// The Id of the thread in the scope of the grid.
int globalId = localId + startOfBlock;
if (globalId <= dataset){
//do something and use shared memory
__syncthread()
//do something
...
}
return;
Also in this case threads in a warp can diverge! But this kind of code works well and the final result is correct! Why? It is releted to particular CUDA version or GPU compute capability?
In another example I read this:
//dataset 1M
// Id of thread in the block.
int localId = threadIdx.x;
// The first index of output element computed by this block.
int startOfBlock = blockIdx.x * blockDim.x;
// The Id of the thread in the scope of the grid.
int globalId = localId + startOfBlock;
if (globalId >= dataset)
return;
//do something and use shared memory
__syncthread()
//do something
...
return;
I think that this example is better than the previous because in this case threads that does not meet the condition are killed. But also in this case, the __syncthread() function is not called by all 32 threads in a warp.
Can someone explain me the __syncthread() behaviour?
Thanks a lot.