Unknown warp serializing

norris.j · March 5, 2010, 2:40pm

Hey all,

Just getting to grips with shared memory & using it effectively, my program is counting “warp serialize” values from the profiler tool & I’m not sure why. I’m using typical shared memory accesses like from the programming guide, and the model (array sizes & access patterns) seem to work when isolated into a test kernel without serializing. So as far as I can tell it’s a good model.

Excuse the over-use of #defines, thats both so i can swap between shared memory/register memory easy & so I can guarantee a single array-indexing pattern.

Block sizes are 16x16, grid size is 64x48.

I think it might be the shared dx/dy or srcx/srcy (and using them to index into the texture), but I just don’t know how that could be true if I’m always indexing into the array in the same way (which sometimes works).

The serialize value is very low, but enough to make the shared memory worthless, it’s actually quicker if all values are in global/register mem & the bilerp happens there. Maybe I’ve got the wrong concept for shared mem (again) & I’m using it in the wrong way, please tell me so! :)

__global__ void lens_correct_kernel (	unsigned char* imgSrc, uint imgSrcW, uint imgSrcH, uint imgSrcP,

							unsigned char* imgDst, uint imgDstW, uint imgDstH, uint imgDstP, float* mapX, float* mapY )

{

	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	__shared__ float _dx[LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

	__shared__ float _dy[LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

	#define dx _dx[ threadIdx.y ][ threadIdx.x ]

	#define dy _dy[ threadIdx.y ][ threadIdx.x ]

	dx = tex2D ( mapXTex, x, y );

	dy = tex2D ( mapYTex, x, y );

	__shared__ int2 _srcxy[LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

	_srcxy[ threadIdx.y ][ threadIdx.x ] = make_int2 ( (int)dx, (int)dy );

	#define SRCX	_srcxy[ threadIdx.y ][ threadIdx.x ].x

	#define SRCY	_srcxy[ threadIdx.y ][ threadIdx.x ].y

	dx = dx - (float)SRCX;

	dy = dy - (float)SRCY;

	#define lookup(X,Y,P) ( (Y) * P ) + ( (X) * 4 )

	// bounds check for bilerp 

	if (( (SRCX + 1) < imgSrcW ) && ( (SRCY + 1) < imgSrcH ))

	{

		__shared__ uchar4 _x0y0 [LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

		__shared__ uchar4 _x0y1 [LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

		__shared__ uchar4 _x1y0 [LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

		__shared__ uchar4 _x1y1 [LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

		__shared__ uchar4 _result [LENS_BLOCK_DIM_X][LENS_BLOCK_DIM_Y+1];

		#define x0y0 _x0y0 [ threadIdx.y ][ threadIdx.x ]

		#define x0y1 _x0y1 [ threadIdx.y ][ threadIdx.x ]

		#define x1y0 _x1y0 [ threadIdx.y ][ threadIdx.x ]

		#define x1y1 _x1y1 [ threadIdx.y ][ threadIdx.x ]

		#define result _result [ threadIdx.y ][ threadIdx.x ]

		x0y0 = tex2D ( rgbImageTex, SRCX, SRCY );

		x0y1 = tex2D ( rgbImageTex, SRCX, SRCY+1 );

		x1y1 = tex2D ( rgbImageTex, SRCX+1, SRCY+1 );

		x1y0 = tex2D ( rgbImageTex, SRCX+1, SRCY );

		result = bilerp_c3 ( x0y0, x0y1, x1y1, x1y0, dx, dy );

		*(uchar4*)(imgDst + lookup(x, y, imgDstP)) = result;

	}

	else

	{

		// just copy the boundary pixels

		*(uchar4*)(imgDst + lookup(x, y, imgDstP)) = tex2D ( rgbImageTex, SRCX, SRCY );

	}

	#undef x0y0

	#undef x0y1

	#undef x1y1

	#undef x1y0

	#undef lookup

	#undef SRCX

	#undef SRCY

	#undef dx

	#undef dy

}

inline __device__ uchar4 bilerp_c3 ( uchar4 x1y1, uchar4 x1y2, uchar4 x2y2, uchar4 x2y1, float dx, float dy )

{

	#define OMDX (1.0f - dx)

	#define OMDY (1.0f - dy)

	return make_uchar4 ((unsigned char)(( OMDX * OMDY * (float)x1y1.x) + (OMDX * dy * (float)x1y2.x) + (dx * OMDY * (float)x2y1.x) + (dx * dy * (float)x2y2.x )),

						(unsigned char)(( OMDX * OMDY * (float)x1y1.y) + (OMDX * dy * (float)x1y2.y) + (dx * OMDY * (float)x2y1.y) + (dx * dy * (float)x2y2.y )),

						(unsigned char)(( OMDX * OMDY * (float)x1y1.z) + (OMDX * dy * (float)x1y2.z) + (dx * OMDY * (float)x2y1.z) + (dx * dy * (float)x2y2.z )),

						0 );

	#undef OMDX

	#undef OMDY

}

Topic		Replies	Views
Warp serialisation with shared memory CUDA Programming and Performance	1	1011	April 12, 2009
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3277	May 26, 2010
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1663	October 27, 2009
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2057	May 14, 2010
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3681	May 6, 2010
Serialized warp when accessing ushort4 items CUDA Programming and Performance	1	911	May 9, 2013
Shared memory out of bounds (simple convolution) CUDA Programming and Performance	6	693	June 21, 2017
Warp serialise problem No constant mem arrays or shared mem used! CUDA Programming and Performance	2	1204	April 25, 2009
Optimizing and the 1.1 profiler CUDA Programming and Performance	6	4803	January 4, 2008
help me! coalesced access CUDA Programming and Performance	15	7015	October 27, 2010

Unknown warp serializing

Related topics