Hello!
Analyzing the following code with Nsight, we find that this simple kernel leads to a degree of Instruction Serialization of about 12.5%.
__global__ void RTPostprocessGammaKernel()
{
unsigned int tid = blockIdx.x*blockDim.x+threadIdx.x;
if(tid<rtSize)
{
float4 color = rtFloatBuffer[tid];
float scale = __fdividef(1.0f, color.w);
uchar4 rgba;
color.x = __powf(fmax(0.0f, fmin(1.0f, color.x*scale)), 0.454545f);
color.y = __powf(fmax(0.0f, fmin(1.0f, color.y*scale)), 0.454545f);
color.z = __powf(fmax(0.0f, fmin(1.0f, color.z*scale)), 0.454545f);
rgba.x = (unsigned char)(color.x*255.0f+0.5f);
rgba.y = (unsigned char)(color.y*255.0f+0.5f);
rgba.z = (unsigned char)(color.z*255.0f+0.5f);
rgba.w = 255;
rtRGBABuffer[tid] = rgba;
}
}
Since we do not have any atomics or bank conflicts in shared/constant memory, we wonder which other reasons lead to Instruction Serialization generally?
The rack is a GTX680, compiled CUDA 4.2.
Any ideas?
Thank you!!!