why I do not have a problem with __syncthreads ?

globs47 · May 26, 2010, 7:13am

Hy everyone,

I use several time the following kind of programing:

[/code]

MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

  ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

	

  Mask[threadIdx.x]=0;

	if( ix >= 0 && ix < MaxX){

	  Mask[threadIdx.x]=data[ix];

	  __syncthreads();

	 if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	   Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

	 }

}


And that works ! That's strange because not all the threads reach syncthreads...

I should use instead:


MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

bool ok;

ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

Mask[threadIdx.x]=0;

  ok=false;

if( ix >= 0 && ix < MaxX){

	Mask[threadIdx.x]=data[ix];

	ok=true;

  }

  __syncthreads();

if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	  Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

}

[code]

I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…

Any idea why the first way seems ok ?

Yves

tera · May 26, 2010, 7:54am

For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.

tera · May 26, 2010, 7:54am

For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.

plmae · May 26, 2010, 8:59am

Hy everyone,

I use several time the following kind of programing:

[/code]

MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

  ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

	

  Mask[threadIdx.x]=0;

	if( ix >= 0 && ix < MaxX){

	  Mask[threadIdx.x]=data[ix];

	  __syncthreads();

	 if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	   Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

	 }

}


And that works ! That's strange because not all the threads reach syncthreads...

I should use instead:


MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

bool ok;

ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

Mask[threadIdx.x]=0;

  ok=false;

if( ix >= 0 && ix < MaxX){

	Mask[threadIdx.x]=data[ix];

	ok=true;

  }

  __syncthreads();

if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	  Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

}

[code]

I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…

Any idea why the first way seems ok ?

Yves

maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?

plmae · May 26, 2010, 8:59am

Hy everyone,

I use several time the following kind of programing:

[/code]

MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

  ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

	

  Mask[threadIdx.x]=0;

	if( ix >= 0 && ix < MaxX){

	  Mask[threadIdx.x]=data[ix];

	  __syncthreads();

	 if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	   Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

	 }

}


And that works ! That's strange because not all the threads reach syncthreads...

I should use instead:


MyKernel(int MaxX,int *data,int *Result)

__shared__ int		Mask[256];

int ix;

bool ok;

ix= blockIdx.x * blockDim.x+threadIdx.x-1;	

Mask[threadIdx.x]=0;

  ok=false;

if( ix >= 0 && ix < MaxX){

	Mask[threadIdx.x]=data[ix];

	ok=true;

  }

  __syncthreads();

if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)

	  Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2

}

[code]

I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…

Any idea why the first way seems ok ?

Yves

maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?

Sarnath · May 26, 2010, 10:03am

You need to thank your luck!

Sarnath · May 26, 2010, 10:03am

You need to thank your luck!

Nighthawk13 · May 26, 2010, 3:11pm

You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:

http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Nighthawk13 · May 26, 2010, 3:11pm

You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:

http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

globs47 · May 26, 2010, 4:17pm

Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…

Thanks a lot

Yves

globs47 · May 26, 2010, 4:17pm

Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…

Thanks a lot

Yves

Topic		Replies	Views
Problems with __syncthreads() CUDA Programming and Performance	2	897	May 4, 2013
Semantics of __syncthreads CUDA Programming and Performance	18	18035	January 2, 2008
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3316	April 30, 2014
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2605	December 29, 2011
__syncthreads thread syncronization CUDA Programming and Performance	7	18620	October 27, 2009
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4024	August 26, 2010
__syncthreads() is ignored by threads CUDA Programming and Performance	4	7632	December 5, 2011
using syncthreads still at n00b status CUDA Programming and Performance	4	16035	December 1, 2010
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6909	May 26, 2011
__syncthreads screwes calculation CUDA Programming and Performance	2	3382	November 22, 2007

why I do not have a problem with __syncthreads ?

Related topics