syncthreads() issue

Wrote this simple test program to try shared memory. Basically, it loads data from global memory(da) to shared memory(s), and then copy (s) back to global memory(db).

Problem is that it prints out correct results WITHOUT using __syncthreads(). If use __syncthreads(), most results are zero. I thought it would be the other way around.

Really appreciate if anyone can help!

[codebox] 1 #include

  2 

  3 using namespace std;

  4 

  5 __global__ void kernel(float *,float*);

  6 

  7 int main()

  8 {

  9 float *ha,*hb,*da,*db;

 10 int size=51;

 11 

 12 ha = (float *)malloc(size*sizeof(float));

 13 hb = (float *)malloc(size*sizeof(float));

 14 cudaMalloc((void **)&da,size*sizeof(float));

 15 cudaMalloc((void **)&db,size*sizeof(float));

 16 

 17 for(int m=0;m<size;m++){

 18  ha[m]=m;

 19 }

 20 

 21 cudaMemcpy(da,ha,size*sizeof(float),cudaMemcpyHostToDevice);

 22 kernel<<<1,51>>>(da,db);

 23 cudaMemcpy(hb,db,size*sizeof(float),cudaMemcpyDeviceToHost);

 24 

 25 for(int m=0;m<size;m++){

 26 cout<<m<<":"<<hb[m]<<endl;

 27 }

 28 

 29 free(ha);free(hb);

 30 cudaFree(da);cudaFree(db);

 31 }

 32 

 33 __global__ void kernel(float *da,float*db)

 34 {

 35 int idx= blockDim.x*blockIdx.x+threadIdx.x;

 36 

 37 extern __shared__ float s[];

 38 

 39 s[idx]=da[idx];

 40 

 41 __syncthreads(); //????

 42 

 43 db[idx]=s[idx];

 44 

 45 }[/codebox]

Hi,

There are two issues in the code:

a. You dont call cudaThreadSynchronize (or equiv) after the kernel call. Therefore the value in db, when you do the cudamemcpy is not valid (because the call to the

 kernel is async the CPU immediatly go to the memcpy before the kernel completes)

b. The kernel doesn’t do what you wanted. You simple copy data from da[idx] to s[idx] and then from s[idx] to db[idx] you dont realy use the shared memory here.

   Its like you wrote db[idx] = da[idx];   the __syncthreads() here doesnt have any meaning.

eyal

This is not correct. Cuda understands the dependency between kernel launch and cudamemcpy, and cudamemcpy will wait for the kernel to finish before attempting to copy. Streams can change dependency between invocations but in this case there are no streams.

The problem is that this:

extern __shared__ float s[];

declares dynamically sized shared memory, where the size is determined when the kernel is launched. Programming Guide 2.1, page 21.

But this:

kernel<<<1,51>>>(da,db);

does not specify a size for shared memory, so zero dynamic shared memory is available. Programming Guide 2.1, page 23.

Try this instead:

kernel<<<1, 51, 51*sizeof(float)>>>(da,db);

Jamie,

You are the man! It worked just as you said… Thanks a lot!

eyal, still wanna thank you for your thoughts on this!