Warps and Shared Memory

Dear Cuda experts,

Sorry for the stupid question, but I can’t understand where I’m wrong.

I have a code in which there are 64 threads in which some value is compute and stored in shared memory. In the last of this threads I would like to re-read the value in the shared memory in order to compare the results.

Something like this:

__shared__ float results[64]

....

if(threadIdx.x>31 && threadIdx.x<96) {

 nthread=threadIdx.x-32;

 .....(calculations)......

 results[nthread]= xx;

 __syncthreads();

                   <<<<---------BREAK 1

 if(threadIdx.x==95) {

    for(int i=0;i<64;i++) {

       yy = results[i];

                   <<<<---------BREAK 2

    }

 }

}

.....

If I “extract” the contains of results using a vector in the global memory in the points indicated with BREAK 1 and BREAK 2, I get different values: in particular the first 32 are wrong while the last 32 are right.

Do you have any idea?

thanks a lot,

g.

don’t issue __syncthreads() inside a if-then-else because behavior is undefined.

try following code

__shared__ float results[64]

....

if(threadIdx.x>31 && threadIdx.x<96) {

    nthread=threadIdx.x-32;

    .....(calculations)......

    results[nthread]= xx;

}

__syncthreads();

if(threadIdx.x==95) {

    for(int i=0;i<64;i++) {

       yy = results[i];

                   <<<<---------BREAK 2

    }

}

Yeah. Technically your progam shouldn’t finish because syncthreads should wait for all threads to complete. Clearly all your threads are not giong to hit the syncthreads which should lead to “infinite wait”.