Runtinme occupancy

Something I’ve wondered since I first started to use GPU: I can calculate the occupancy using the calculator, but lets say I get x50 performance boost and
the calculator gives me 50% occupancy (or even 75%) - how do I know for sure that I get the most performance from the GPU? I know I can use the profiler
but again it didnt help me so much.
Is there some better way to see how much my program uses the GPU power?


There are 2 bottlenecks:

  • GFLOPS, therefore you would need to count how many FLOPS your kernel does and divide my your kernel-execution time.

  • Memory bandwidth. For that you have to count how many bytes you are reading and writing, and again divide by the kernel-execution time.

If you are far off both theoretical bandwidths, you have performance to be gained


I figured that would be the response :) however what if I’m not sure if I count either correctly? why doesn’t nVidia supply a runtime monitor for this purpose?

Something like the performance monitor microsft gives on the CPU.



Maybe because the hardware doesn’t have support for these kinds of counters? As you can see in the docs of the visual profiler, the profiler is actually only measuring stats for 1 multiprocessor, so I think these counters are not available.

Actually, it would be possible for NVIDIA to use the processor counters to supply this information in a rough form.

The Visual Profiler counters (when run on pre-G200 hardware) give a good insight. I believe the # of coalesced reads can be converted to bandwidth by multiplying by 64 bytes (if you do float load/stores, it would be 128 for float2s or 32 for shorts, and I think 128 again for float4s) and dividing by time. The # of uncoalesced reads can be converted to bandwidth by multiplying by 4 (or 8 or 2). The ‘instructions’ field can be used the same way. (Although I’m not sure of the details. Probably 1 instruction counts as roughly 48 FLOPs*, except in the case of divergence. Divergence muddies the picture, since more ‘instructions’ are reported with each accounting for fewer OPs.

  • For the purpose of comparing to NVIDIA’s marketing GFLOPS numbers, which multiply every theoretical instruction per second by 3. This isn’t precise (sometimes dual-issue does happen–with perfect dual-issue you would multiply by 1.5), but you’re looking for order-of-magnitude trends anyway.

NOTE: I may be/probably am mistaken in the details! But it works roughly like this. You can make a simple test code to make sure.

The other elements of the Visual Profile are as follows: ‘serializations’ reflect shared memory conflicts (these decrease FLOPs), and ‘divergence’ gives a very rough report in that regard. (Divergence, if it exists, is much better estimated by carefully setting your kernel’s input to guarantee uniform execution, and then comparing total instruction counts.)

the profiler profiles only 1 MP as far as I remember…