Dear Cuda experts,
Sorry for the stupid question, but I can’t understand where I’m wrong.
I have a code in which there are 64 threads in which some value is compute and stored in shared memory. In the last of this threads I would like to re-read the value in the shared memory in order to compare the results.
Something like this:
__shared__ float results[64]
....
if(threadIdx.x>31 && threadIdx.x<96) {
nthread=threadIdx.x-32;
.....(calculations)......
results[nthread]= xx;
__syncthreads();
<<<<---------BREAK 1
if(threadIdx.x==95) {
for(int i=0;i<64;i++) {
yy = results[i];
<<<<---------BREAK 2
}
}
}
.....
If I “extract” the contains of results using a vector in the global memory in the points indicated with BREAK 1 and BREAK 2, I get different values: in particular the first 32 are wrong while the last 32 are right.
Do you have any idea?
thanks a lot,
g.