Reasons for Instruction Serialization apart from atomics and bank conflicts

Hello!

Analyzing the following code with Nsight, we find that this simple kernel leads to a degree of Instruction Serialization of about 12.5%.

__global__ void RTPostprocessGammaKernel()
{
	unsigned int	tid = blockIdx.x*blockDim.x+threadIdx.x;
	
	if(tid<rtSize)
	{
		float4	color = rtFloatBuffer[tid];
		float	scale = __fdividef(1.0f, color.w);
		uchar4	rgba;
		
		color.x = __powf(fmax(0.0f, fmin(1.0f, color.x*scale)), 0.454545f);
		color.y = __powf(fmax(0.0f, fmin(1.0f, color.y*scale)), 0.454545f);
		color.z = __powf(fmax(0.0f, fmin(1.0f, color.z*scale)), 0.454545f);
		
		rgba.x = (unsigned char)(color.x*255.0f+0.5f);
		rgba.y = (unsigned char)(color.y*255.0f+0.5f);
		rgba.z = (unsigned char)(color.z*255.0f+0.5f);
		rgba.w = 255;
		
		rtRGBABuffer[tid] = rgba;
	}
}

Since we do not have any atomics or bank conflicts in shared/constant memory, we wonder which other reasons lead to Instruction Serialization generally?

The rack is a GTX680, compiled CUDA 4.2.

Any ideas?

Thank you!!!

The two primary reasons for serialization are

  • vector accesses (> 32-bit per element), and
  • address divergence.

There are additional reasons but these are not controllable by the developer and are usually not as frequent. The CUDA C Programming Guide has additional information on vector accesses at Programming Guide :: CUDA Toolkit Documentation.