a simple shuffle example?

Hello , I want to try to use shuffle in my code instead of shared memory but I am not sure how.

For example , in my code , I am using:

#define TILE_WIDTH 16
    
             __shared__   mycomplex myshared[ TILE_WIDTH + 2 ][ TILE_WIDTH ];


	     size_t RowIdx = threadIdx.y + blockIdx.y * TILE_WIDTH;
	     size_t ColIdx = threadIdx.x + blockIdx.x * TILE_WIDTH;


	    for ( int ImgIdx = 0; ImgIdx < NbOfImgs; ImgIdx++ )
	    {

		__syncthreads();

		IJ = ColIdx + NbOfCols * ( RowIdx + NbOfRows * ImgIdx);

		....

		//copy input data to shared memory
		myshared[ threadIdx.y + 1 ][ threadIdx.x ] = *( devinputArr + IJ );

		if ( threadIdx.y == 0 )
			myshared[ threadIdx.y    ][ threadIdx.x ] = *( devinputArr + ImJ );

		if ( threadIdx.y == (TILE_WIDTH - 1) )
			myshared[ threadIdx.y + 2][ threadIdx.x ] = *( devinputDArr + IpJ );

		__syncthreads();
		
		....
		
		myRe = myshared[ threadIdx.y ][ threadIdx.x + 1 ].Re;
		myIm = myshared[ threadIdx.y ][ threadIdx.x + 1 ].Im;
		
		....

How can I use shuffle?

I read this http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ , I am trying to understand how it works,which are the limitations (memory or other)…

Thanks!

Can anyone give some help on this?

Thanks!

I think the easiest way to get started with warp shuffles is to follow some tutorials (unrelated to your particular problem) - then afterwards try to transfer your new knowledge to your problem.

There’s some good reading material out there

http://lmgtfy.com/?q=warp+shuffle+tutorial

a couple of gotchas and tricks I ran into:

-Be careful with conditionals: always make sure all involved threads participate in the warp shuffle, those that provide the lane data AND those threads that need the (foreign) lane data.

-the optional width specifier (last optional argument) in the shuffle instruction can come in VERY handy sometimes for segmenting the warp into partitions - especially considering that your tile width is 16.

Τhanks for your suggestions but I have read these and they seem quite difficult to understand/implement …

I can’t even make the beginning using my example…

If someone could explain how to implement shuffle in the above code , I 'll appreciate.

Thanks

Now , I tried a few things ( not that I understand everything I am doing !! ) , but of course I get no results (NaN values everywhere ) .

I tried to follow the example I have in first post.

__device__ mycomplex warpShuffle( mycomplex val )
{
		val.Re = __shfl( val.Re , 0 );
		val.Im = __shfl( val.Im , 0 );


	return val;
}

__device__ mycomplex blockShuffle( mycomplex val )
{
	static __shared__ mycomplex shared[ 16 ][ 16 + 2 ];
	
	int XlaneIdx = threadIdx.x % warpSize;
	int Xwidth = threadIdx.x / warpSize;
	int YlaneIdx = threadIdx.y % warpSize;
	int Ywidth = threadIdx.y / warpSize;

	val = warpShuffle( val );

	//write value to shared memory
	shared[ YlaneIdx ][ XlaneIdx + 1 ] = val;

	if ( XlaneIdx == 0 ) shared[ YlaneIdx ][ XlaneIdx ] = val;

	if ( XlaneIdx == (TILE_WIDTH - 1) ) shared[ YlaneIdx ][ XlaneIdx + 2 ] = val;

	__syncthreads();

	//ensure we only grab a value from shared memory if that warp existed
	val.Re = ( threadIdx.x < blockDim.x / warpSize ) ? ( shared[ YlaneIdx ][ XlaneIdx ] ).Re : float( 0.f );
	val.Im = ( threadIdx.y < blockDim.y / warpSize ) ? ( shared[ YlaneIdx ][ XlaneIdx ] ).Im : float( 0.f );

	if( ( Xwidth == 0 ) && ( Ywidth == 0 ) ) val = warpShuffle( val );

	return val;

}

Now , in global function :

I remove the shared mycomplex myshared[ TILE_WIDTH + 2 ][ TILE_WIDTH ];

__global__ void (...)
{
    ...
    mycomplex Parts , Parts2 , Parts3;
    ...
    
    //copy input data to shared memory
	Parts  = blockShuffle( *(devinputArr + IJ) );
	


	if ( threadIdx.x == 0 )
	{

		Parts2  = blockShuffle( *(devinputArr + IJm1) );

	}


	if ( threadIdx.x == (TILE_WIDTH - 1) )
	{

		Parts3  = blockShuffle( *(devinputArr + IJp1)  );

	}
		__syncthreads();
		
	...
	
	myRe = Parts.Re;
	myIm = Parts.Im;
	...

Any suggestions?

Thanks!

there is the sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-prefix-sum-with-shuffle-intrinsics–shfl_scan-

Thanks for the link.
But I just realize that it is too hard for me to understand how to do it in my code…
Anyway , thank you.