Hi All
The question:
My kernel is basically a cast from 16-bit int to float and subsequent complex multiplication. When I analyze this kernel, it shows 7 GFlops and 24 Math GOps/second. Why am I not getting more GFlops? I expect that part to be the majority of it all, since I have 4 multiplications for each cast?
The code is the following:
__global__ void castAndDefocus_kernel( float2 * __restrict__ pDst, const int16_t2 * __restrict__ pSrc, const float2 * __restrict__ pDefocus, int nInputSamples, int nChannels, int nMaxDefocus )
{
int nSample = threadIdx.x + blockIdx.x * blockDim.x;
int nChannel = threadIdx.y + blockIdx.y * blockDim.y;
if( nSample >= nInputSamples || nChannel >= nChannels )
return;
int nSampleOffset = nSample + nChannel * nInputSamples;
float2 sValue = make_float2( __int2float_rn( pSrc[nSampleOffset].x ), __int2float_rn( pSrc[nSampleOffset].y ));
if( nSample < nMaxDefocus ){
sValue = complexMult( sValue, pDefocus[nSample + nChannel * nMaxDefocus]);
}
pDst[nSampleOffset] = sValue;
}
The code runs on a Quadro K2000M (384 cores).
I have tried looking around but found no general description of how to interpret the different outputs of the nSight kernel profiler, and I guess it is also quite difficult to make an easy to read document. However, if I missed a valid source explaining this, I would appreciate a link.
Thank you
Henrik Andresen