Hy everyone,
I use several time the following kind of programing:
[/code]
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
__syncthreads();
if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
}
And that works ! That's strange because not all the threads reach syncthreads...
I should use instead:
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
bool ok;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
ok=false;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
ok=true;
}
__syncthreads();
if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
[code]
I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…
Any idea why the first way seems ok ?
Yves
tera
May 26, 2010, 7:54am
2
For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.
tera
May 26, 2010, 7:54am
3
For one, undefined behaviour is just that - there is no guarantee it is going to crash or something.
plmae
May 26, 2010, 8:59am
4
Hy everyone,
I use several time the following kind of programing:
[/code]
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
__syncthreads();
if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
}
And that works ! That's strange because not all the threads reach syncthreads...
I should use instead:
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
bool ok;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
ok=false;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
ok=true;
}
__syncthreads();
if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
[code]
I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…
Any idea why the first way seems ok ?
Yves
maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?
plmae
May 26, 2010, 8:59am
5
Hy everyone,
I use several time the following kind of programing:
[/code]
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
__syncthreads();
if(threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
}
And that works ! That's strange because not all the threads reach syncthreads...
I should use instead:
MyKernel(int MaxX,int *data,int *Result)
__shared__ int Mask[256];
int ix;
bool ok;
ix= blockIdx.x * blockDim.x+threadIdx.x-1;
Mask[threadIdx.x]=0;
ok=false;
if( ix >= 0 && ix < MaxX){
Mask[threadIdx.x]=data[ix];
ok=true;
}
__syncthreads();
if(ok && threadIdx.x > 0 && threadIdx.x < blockDim.x-1)
Result[ix]=(Mask[threadIdx.x+1]-Mask[threadIdx.x-1])/2
}
[code]
I have dozen of CUDA codes wrote following the first way. I would like to make sure that I have to write them according the second way…
Any idea why the first way seems ok ?
Yves
maybe __syncthreads needs only all warps to reach it, if warp reach __syncthreads masked threads call it too, and only writes are masked in this if?
You need to thank your luck!
You need to thank your luck!
You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
You are right, only one thread in each warp needs to reach the barrier, at least for the GT200 Chips:
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…
Thanks a lot
Yves
Nice paper ! However I don’t expect this behavior keeping valid for the next(s) GPU architecture. From all the answers, I do think that I have to rewritte the kernels in a safe manner…
Thanks a lot
Yves