Hi, I am a new CUDA user, I have a question about loading data into shared memory in “if statement”
what I want to do is: if all the threads in the block get result=0, then it won’t load data into shared memory any more;
if there is any thread in the block get result=1, then it continues to load data into shared memory and do computation.
I have written codes in kernel like this, it is very slow. Is it correct?
for(i=0; i<n; i++){
if(result!=0){ //if all the threads get result==0, it won't continue????
load data into shared memory;
__syncthread();
do computation; //will the thread (result==0) do computation??? since all the threads are synchronized to load data into shared memory again.
if(result==0)
break;
}
}
The problem itself has serious divergences. In a thread block, some threads will stop quickly if they get result ==0, some threads will continue to load data and computation if they get result == 1. Do you have any good suggestion to this kind of problems??
At first, I only use global memory, it is very slow. Since for my application, threads can’t load data from global memory in a coalesced way.
Now, I am trying to use shared memory to get better data access and data reuse.
What I hope is that if all the threads in a block get result==0, it will stop and return. But it seems that “__syncthread()” doesn’t allow the thread block stop, even all the threads in a block have gotten result==0.
like in the codes
for(i=0; i<n; i++){
if(result!=0){ //if all the threads get result==0, will it continue???
load data into shared memory;
__syncthread();
do computation;
if(result==0)
break;