Help analysing kernel performance through nSight

Hi All

The question:
My kernel is basically a cast from 16-bit int to float and subsequent complex multiplication. When I analyze this kernel, it shows 7 GFlops and 24 Math GOps/second. Why am I not getting more GFlops? I expect that part to be the majority of it all, since I have 4 multiplications for each cast?

The code is the following:

__global__ void castAndDefocus_kernel( float2 * __restrict__ pDst, const int16_t2 * __restrict__ pSrc, const float2 * __restrict__ pDefocus, int nInputSamples, int nChannels, int nMaxDefocus )
	int nSample = threadIdx.x + blockIdx.x * blockDim.x;
	int nChannel = threadIdx.y + blockIdx.y * blockDim.y;
	if( nSample >= nInputSamples || nChannel >= nChannels )

	int nSampleOffset = nSample + nChannel * nInputSamples;
	float2 sValue = make_float2( __int2float_rn( pSrc[nSampleOffset].x ), __int2float_rn( pSrc[nSampleOffset].y ));
	if( nSample < nMaxDefocus ){
		sValue = complexMult( sValue, pDefocus[nSample + nChannel * nMaxDefocus]);
	pDst[nSampleOffset] = sValue;

The code runs on a Quadro K2000M (384 cores).

I have tried looking around but found no general description of how to interpret the different outputs of the nSight kernel profiler, and I guess it is also quite difficult to make an easy to read document. However, if I missed a valid source explaining this, I would appreciate a link.

Thank you

Henrik Andresen

Hi Henrik,

The FLOPS seem to be low. However, that may not be a real issue. The first thing you need to do is to determine what is the limiting factor of your kernel: Is it limited by the memory system? Is it limited by the instruction throughput? Is it limited by the latency?

I would recommend three documents to help you deal with those questions: The CUDA Best Practice Guide which is included with the CUDA Toolkit and two GTC presentations:

The two URLs are links to PDFs of the slides. There are also webcasts to get the explanations. Go to GTC-On-Demand:


I just got an update on my RSS on this post now. Don’t know why it was so delayed.

I apologize for not responding. Thank you for your reply. It is definitely memory bound.