Reasons for Instruction Serialization apart from atomics and bank conflicts


Analyzing the following code with Nsight, we find that this simple kernel leads to a degree of Instruction Serialization of about 12.5%.

__global__ void RTPostprocessGammaKernel()
	unsigned int	tid = blockIdx.x*blockDim.x+threadIdx.x;
		float4	color = rtFloatBuffer[tid];
		float	scale = __fdividef(1.0f, color.w);
		uchar4	rgba;
		color.x = __powf(fmax(0.0f, fmin(1.0f, color.x*scale)), 0.454545f);
		color.y = __powf(fmax(0.0f, fmin(1.0f, color.y*scale)), 0.454545f);
		color.z = __powf(fmax(0.0f, fmin(1.0f, color.z*scale)), 0.454545f);
		rgba.x = (unsigned char)(color.x*255.0f+0.5f);
		rgba.y = (unsigned char)(color.y*255.0f+0.5f);
		rgba.z = (unsigned char)(color.z*255.0f+0.5f);
		rgba.w = 255;
		rtRGBABuffer[tid] = rgba;

Since we do not have any atomics or bank conflicts in shared/constant memory, we wonder which other reasons lead to Instruction Serialization generally?

The rack is a GTX680, compiled CUDA 4.2.

Any ideas?

Thank you!!!

The two primary reasons for serialization are

  • vector accesses (> 32-bit per element), and
  • address divergence.

There are additional reasons but these are not controllable by the developer and are usually not as frequent. The CUDA C Programming Guide has additional information on vector accesses at Programming Guide :: CUDA Toolkit Documentation.