Hello , I want to try to use shuffle in my code instead of shared memory but I am not sure how.
For example , in my code , I am using:
#define TILE_WIDTH 16
__shared__ mycomplex myshared[ TILE_WIDTH + 2 ][ TILE_WIDTH ];
size_t RowIdx = threadIdx.y + blockIdx.y * TILE_WIDTH;
size_t ColIdx = threadIdx.x + blockIdx.x * TILE_WIDTH;
for ( int ImgIdx = 0; ImgIdx < NbOfImgs; ImgIdx++ )
IJ = ColIdx + NbOfCols * ( RowIdx + NbOfRows * ImgIdx);
//copy input data to shared memory
myshared[ threadIdx.y + 1 ][ threadIdx.x ] = *( devinputArr + IJ );
if ( threadIdx.y == 0 )
myshared[ threadIdx.y ][ threadIdx.x ] = *( devinputArr + ImJ );
if ( threadIdx.y == (TILE_WIDTH - 1) )
myshared[ threadIdx.y + 2][ threadIdx.x ] = *( devinputDArr + IpJ );
myRe = myshared[ threadIdx.y ][ threadIdx.x + 1 ].Re;
myIm = myshared[ threadIdx.y ][ threadIdx.x + 1 ].Im;
How can I use shuffle?
I read this http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ , I am trying to understand how it works,which are the limitations (memory or other)…
Can anyone give some help on this?
I think the easiest way to get started with warp shuffles is to follow some tutorials (unrelated to your particular problem) - then afterwards try to transfer your new knowledge to your problem.
There’s some good reading material out there
a couple of gotchas and tricks I ran into:
-Be careful with conditionals: always make sure all involved threads participate in the warp shuffle, those that provide the lane data AND those threads that need the (foreign) lane data.
-the optional width specifier (last optional argument) in the shuffle instruction can come in VERY handy sometimes for segmenting the warp into partitions - especially considering that your tile width is 16.
Τhanks for your suggestions but I have read these and they seem quite difficult to understand/implement …
I can’t even make the beginning using my example…
If someone could explain how to implement shuffle in the above code , I 'll appreciate.
Now , I tried a few things ( not that I understand everything I am doing !! ) , but of course I get no results (NaN values everywhere ) .
I tried to follow the example I have in first post.
__device__ mycomplex warpShuffle( mycomplex val )
val.Re = __shfl( val.Re , 0 );
val.Im = __shfl( val.Im , 0 );
__device__ mycomplex blockShuffle( mycomplex val )
static __shared__ mycomplex shared[ 16 ][ 16 + 2 ];
int XlaneIdx = threadIdx.x % warpSize;
int Xwidth = threadIdx.x / warpSize;
int YlaneIdx = threadIdx.y % warpSize;
int Ywidth = threadIdx.y / warpSize;
val = warpShuffle( val );
//write value to shared memory
shared[ YlaneIdx ][ XlaneIdx + 1 ] = val;
if ( XlaneIdx == 0 ) shared[ YlaneIdx ][ XlaneIdx ] = val;
if ( XlaneIdx == (TILE_WIDTH - 1) ) shared[ YlaneIdx ][ XlaneIdx + 2 ] = val;
//ensure we only grab a value from shared memory if that warp existed
val.Re = ( threadIdx.x < blockDim.x / warpSize ) ? ( shared[ YlaneIdx ][ XlaneIdx ] ).Re : float( 0.f );
val.Im = ( threadIdx.y < blockDim.y / warpSize ) ? ( shared[ YlaneIdx ][ XlaneIdx ] ).Im : float( 0.f );
if( ( Xwidth == 0 ) && ( Ywidth == 0 ) ) val = warpShuffle( val );
Now , in global function :
I remove the shared mycomplex myshared[ TILE_WIDTH + 2 ][ TILE_WIDTH ];
__global__ void (...)
mycomplex Parts , Parts2 , Parts3;
//copy input data to shared memory
Parts = blockShuffle( *(devinputArr + IJ) );
if ( threadIdx.x == 0 )
Parts2 = blockShuffle( *(devinputArr + IJm1) );
if ( threadIdx.x == (TILE_WIDTH - 1) )
Parts3 = blockShuffle( *(devinputArr + IJp1) );
myRe = Parts.Re;
myIm = Parts.Im;
Thanks for the link.
But I just realize that it is too hard for me to understand how to do it in my code…
Anyway , thank you.