why I do not have a problem with __syncthreads ?

Hy everyone,

I use several time the following kind of programing:

[/code]

MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

  ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

	

  Mask[threadIdx.x]=0;

	if( ix >= 0 && ix < MaxX){

	  Mask[threadIdx.x]=data[ix];

	  __syncthreads();

	 if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	   Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

	 }

}


And that works ! That's strange because not all the threads reach syncthreads...

I should use instead:


MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

bool ok;

ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

Mask[threadIdx.x]=0;

  ok=false;

if( ix >= 0 && ix < MaxX){

	Mask[threadIdx.x]=data[ix];

	ok=true;

  }

  __syncthreads();

if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	  Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

}

[code]

I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…

Any idea why the first way seems ok ?

Yves

For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.

For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.

maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?

maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?

You need to thank your luck!

You need to thank your luck!

You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:

http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:

http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…

Thanks a lot

Yves

Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…

Thanks a lot

Yves