Bandwidth limited, Latency limited and Compute limited Need examples for each case

gpuguy · March 17, 2010, 7:10am

Hello,

I am trying to understand the performance of my CUDA programs. In this direction I need to understand various scenarios which limits my program’s performance. In particular I am looking for example kernels in each of the following cases:

1- Kernel which is Bandwidth limited,

2- Kernel which is Latency limited and

3- Kernel which is Compute limited

4-Are there any other type of limitations that can slow down my kernel?

Just small kernel for illustration will suffice for my understanding. I would appreciate your responses in this regard.

Thanks,

eyalhir74 · March 17, 2010, 8:03am

The best thing would probably be to run your kernels via Nexus/nVidia’s profiler. That should give you a hunch…

Personally I think the best way to understand what takes most of your time in your kernel is to comment out portions

of your kernel and see how they affect on the overall time your kernel now runs compared to the original code.

Make sure you comment out correctly - especially try not to cause the compiler to not call the kernel because you

commented out things that now causes the kernel to be meaningless… dead code optimizer…

Obviously a kernel that is not coallesced or uses a lot of gmem reads/writes.

A kernel that does a lot of arithmetic operations such as:

....

float fVal = pGMEMData[ threadIdx.x ];

for ( int i = 0; i < 100000; i++ )

{

   fVal = sqrt( fVal ) * fVal + i;

   fVal += i * cos( i ); 

   // you get the idea :)

}

Occupancy, broken coallescing, lots of syncthreads… (divergence and smem bank conflicts should be the lasts to check…)

I hope that helps a bit…

eyal

Topic		Replies	Views
Question relating to computation-limited application CUDA Programming and Performance	1	2797	November 4, 2008
Bandwidth & Kernel problems: performance degredation. CUDA Programming and Performance	8	5214	December 6, 2010
Where's my bottleneck CUDA Programming and Performance	1	1113	August 29, 2008
Speed-up and bandwidth CUDA Programming and Performance	12	9933	May 4, 2008
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2908	April 27, 2024
Kernel bound by instruction and memory latency. CUDA Programming and Performance	3	2032	November 24, 2017
weird performance in GPU CUDA Programming and Performance	2	2749	October 17, 2011
How to measure if the progress is limited by bandwidth? CUDA Programming and Performance	3	101	July 23, 2024
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	6160	April 3, 2010
How to Implement Performance Metrics in CUDA C/C++ Technical Blog	20	1024	March 11, 2020

Bandwidth limited, Latency limited and Compute limited Need examples for each case

Related topics