Runtinme occupancy

eyalhir74 · January 7, 2009, 7:15am

Hi,
Something I’ve wondered since I first started to use GPU: I can calculate the occupancy using the calculator, but lets say I get x50 performance boost and
the calculator gives me 50% occupancy (or even 75%) - how do I know for sure that I get the most performance from the GPU? I know I can use the profiler
but again it didnt help me so much.
Is there some better way to see how much my program uses the GPU power?

thanks
eyal

E.D_Riedijk · January 7, 2009, 9:09am

There are 2 bottlenecks:

GFLOPS, therefore you would need to count how many FLOPS your kernel does and divide my your kernel-execution time.
Memory bandwidth. For that you have to count how many bytes you are reading and writing, and again divide by the kernel-execution time.

If you are far off both theoretical bandwidths, you have performance to be gained

eyalhir74 · January 7, 2009, 9:16am

Hi,

I figured that would be the response :) however what if I’m not sure if I count either correctly? why doesn’t nVidia supply a runtime monitor for this purpose?

Something like the performance monitor microsft gives on the CPU.

thanks

eyal

E.D_Riedijk · January 7, 2009, 12:50pm

Maybe because the hardware doesn’t have support for these kinds of counters? As you can see in the docs of the visual profiler, the profiler is actually only measuring stats for 1 multiprocessor, so I think these counters are not available.

alex_dubinsky · January 8, 2009, 10:31pm

Actually, it would be possible for NVIDIA to use the processor counters to supply this information in a rough form.

The Visual Profiler counters (when run on pre-G200 hardware) give a good insight. I believe the # of coalesced reads can be converted to bandwidth by multiplying by 64 bytes (if you do float load/stores, it would be 128 for float2s or 32 for shorts, and I think 128 again for float4s) and dividing by time. The # of uncoalesced reads can be converted to bandwidth by multiplying by 4 (or 8 or 2). The ‘instructions’ field can be used the same way. (Although I’m not sure of the details. Probably 1 instruction counts as roughly 48 FLOPs*, except in the case of divergence. Divergence muddies the picture, since more ‘instructions’ are reported with each accounting for fewer OPs.

For the purpose of comparing to NVIDIA’s marketing GFLOPS numbers, which multiply every theoretical instruction per second by 3. This isn’t precise (sometimes dual-issue does happen–with perfect dual-issue you would multiply by 1.5), but you’re looking for order-of-magnitude trends anyway.

NOTE: I may be/probably am mistaken in the details! But it works roughly like this. You can make a simple test code to make sure.

The other elements of the Visual Profile are as follows: ‘serializations’ reflect shared memory conflicts (these decrease FLOPs), and ‘divergence’ gives a very rough report in that regard. (Divergence, if it exists, is much better estimated by carefully setting your kernel’s input to guarantee uniform execution, and then comparing total instruction counts.)

E.D_Riedijk · January 9, 2009, 6:32am

the profiler profiles only 1 MP as far as I remember…

Topic		Replies	Views
Gap between measured perf. and peak CUDA Programming and Performance	8	13074	March 20, 2008
when did I reach max. possible speed? is there a way to know? CUDA Programming and Performance	6	2082	December 26, 2008
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1860	February 3, 2012
Calculating Gflops, memory bandwidth and visual profiler question performance calculation CUDA Programming and Performance	3	13614	October 30, 2023
evaluate the FLOPS CUDA Programming and Performance	5	1981	November 25, 2008
showing gpu utlization per process CUDA Programming and Performance	5	2027	October 12, 2018
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008
How to profile the CUDA application only by nvprof Visual Profiler and nvprof	1	2703	May 21, 2018
How do you measure the GFLOPS for your kernel? CUDA Programming and Performance	0	907	September 13, 2010
how to collect GPU statistics ? CUDA Programming and Performance	6	4838	May 18, 2008

Runtinme occupancy

Related topics