Counting Floating Point Operations with nvprof

I’m currently profiling my app running in a Tesla K80 GPU, and everything is fine when I use nvprof. The only thing that is weird is that the number of floating point operations of one (yes, only one) of the many kernels that I use appears as “overflow”. This might be because it is the kernel that takes the most time, and basically every thread does a for loop and does cosines and sines operations inside of it. On the other hand, I suppose that nvprof has been tested with big apps, and the results shouldn’t be “overflow”.

How can I get the real counter of floating point operations of this kernel?

Has anyone got the same result?

I’m using nvprof: NVIDIA ® Cuda command line profiler Copyright © 2012 - 2015 NVIDIA Corporation Release version 7.0.28

Driver: 352.63

GPU: Tesla K80

This is a hardware limitation. The hardware gods decided that 32-bit counters ought to be enough for everyone. At GHz operating frequencies those 32-bit counter overflow rather quickly! The latest GPUs have slightly wider counters (40 bits? 48 bits? I can’t remember), but I don’t know at what architecture version the wider counters were introduced. What’s your GPU?

Tesla K80

Where can I find Tesla K80’s counter bits?

I have this white paper, but is never mentioned.

Here is a previous answer on SO from someone who I would trust to know this:

It probably would be a good idea for NVIDIA to mention this information in their official documentation somewhere, as I cannot find it in the existing manuals (maybe txbob has a link handy, in case I missed it). Based on Greg’s answer, it would seem the K80 still has 32-bit counters, as it is a Kepler part.

Yes, I was thinking the same thing. I would be helpful if NVIDIA mention this somewhere, since I have to reference that in my thesis/paper. My idea when I started the profiling was to calculate the total FLOPS/s of my app with the biggest dataset that I have. Since I cannot do this, I will have to reduce the data :(.

Thanks njuffa!

As far as I am concerned, the use of 32-bit profiling counters was simply a Bad Idea™. I don’t know what drove that decision. I cannot imagine that incrementation of wider counters would impose frequency limitations, but maybe it would.

On the other hand, it is not too surprising, because from my past involvement with building CPUs I recall that performance counter design is often a hurried afterthought, so one winds up with too small a number of simultaneous counters, any number of counter inaccuracies (most events counted cannot be based on one internal signal alone!), etc. In this case, the result were counters that are simply not wide enough for many real-life scenarios.

Another came up to my head. When you have the effective bandwidth of every kernel how you compared it to the theoretical bandwidth? Also, I been looking for the Tesla K80 (only one GPU) GFLOPS peak, but I couldn’t find it. Does anyone have it? (I got the results reducing the data. Therefore, now I’m not getting the overflow)

If you don’t mind compiling one more executable, give a chance to this routine. I just call it from main and get nice result like that:

GeForce GTX 560 Ti, CC 2.1
VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s
8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS

I am not sure that one would want to compare the bandwidth of an application with the theoretical bandwidth (computed from interface width, interface frequency, and transfers/cycle). It probably makes more sense to compare it to the maximum achievable bandwidth, which is typically 75% to 85% of the theoretical bandwidth (measured, for example, when adding very long vectors of ‘double2’ elements).

How can I calculate the maximum achievable bandwidth?

On the other hand, I been seeing some slides from GTC and other people that calculate the bandwidth using DRAM Read/Write Throughput. However, in the CUDA Best practices manual they calculate the effective bandwidth using the Requested Global Load/Store Throughput. Which is the difference between these two? And how they can be compared?

For example, the kernel that takes more time in my app has a Bandwidth (calculated with DRAM Read/Write Throughput) of 0.029 GB/s. On the other hand the bandwidth calculated with Requested Global Load/Store Throughput is 282 GB/s. Is that possible?

This bandwidth is the theoretical with ECC enabled or disabled?

theoretical. i have no cards with ECC so i don’t know how to compute that. feel free to edit the sources :)

a proxy for this is the device-to-device bandwidth reported by bandwidthTest. there is no calculation method, unless you simply want to use a scaling factor against peak theoretical bandwidth (which can be calculated).

Global loads and stores can hit in one of the caches, so they do not necessarily represent dram device bandwidth. The dram metrics (e.g. dram_utilization, dram_read_transactions, dram_write_transactions) should represent actual activity to the DRAM.

Is 240 GB/s per GPU :).

So the approach is to see the device-to-device bandwidth reported by bandwidthTest, the scale factor between that and the theoretical.

Interesting… but then why would the CUDA C Best Practices calculates de effective bandwidth using the global loads and stores?

Anyone knows why this HUGE difference?

Interesing… so you think you are getting 282GB/s of DRAM bandwidth on a device that has a peak theoretical of 240GB/s of DRAM bandwidth?

Because many of the global transactions are hitting in one of the caches, and only a small percentage actually have to be serviced by DRAM. This will be a function of your actual code, of course.

It doesn’t make sense to have a 282GB/s bandwidth if the theoretical is 240 GB/s so I assume that is because what you said: