Bandwidth limited, Latency limited and Compute limited Need examples for each case


I am trying to understand the performance of my CUDA programs. In this direction I need to understand various scenarios which limits my program’s performance. In particular I am looking for example kernels in each of the following cases:

1- Kernel which is Bandwidth limited,

2- Kernel which is Latency limited and

3- Kernel which is Compute limited

4-Are there any other type of limitations that can slow down my kernel?

Just small kernel for illustration will suffice for my understanding. I would appreciate your responses in this regard.


The best thing would probably be to run your kernels via Nexus/nVidia’s profiler. That should give you a hunch…

Personally I think the best way to understand what takes most of your time in your kernel is to comment out portions

of your kernel and see how they affect on the overall time your kernel now runs compared to the original code.

Make sure you comment out correctly - especially try not to cause the compiler to not call the kernel because you

commented out things that now causes the kernel to be meaningless… dead code optimizer…

Obviously a kernel that is not coallesced or uses a lot of gmem reads/writes.

A kernel that does a lot of arithmetic operations such as:


float fVal = pGMEMData[ threadIdx.x ];

for ( int i = 0; i < 100000; i++ )


   fVal = sqrt( fVal ) * fVal + i;

   fVal += i * cos( i ); 

   // you get the idea :)


Occupancy, broken coallescing, lots of syncthreads… (divergence and smem bank conflicts should be the lasts to check…)

I hope that helps a bit…