Counting Floating Point Operations with nvprof

carcamovski · September 21, 2016, 9:10pm

I’m currently profiling my app running in a Tesla K80 GPU, and everything is fine when I use nvprof. The only thing that is weird is that the number of floating point operations of one (yes, only one) of the many kernels that I use appears as “overflow”. This might be because it is the kernel that takes the most time, and basically every thread does a for loop and does cosines and sines operations inside of it. On the other hand, I suppose that nvprof has been tested with big apps, and the results shouldn’t be “overflow”.

How can I get the real counter of floating point operations of this kernel?

Has anyone got the same result?

Driver: 352.63

GPU: Tesla K80

njuffa · September 21, 2016, 9:15pm

This is a hardware limitation. The hardware gods decided that 32-bit counters ought to be enough for everyone. At GHz operating frequencies those 32-bit counter overflow rather quickly! The latest GPUs have slightly wider counters (40 bits? 48 bits? I can’t remember), but I don’t know at what architecture version the wider counters were introduced. What’s your GPU?

carcamovski · September 21, 2016, 9:29pm

Tesla K80

carcamovski · September 21, 2016, 10:03pm

Where can I find Tesla K80’s counter bits?

I have this white paper https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf, but is never mentioned.

njuffa · September 21, 2016, 10:07pm

Here is a previous answer on SO from someone who I would trust to know this: [url]http://stackoverflow.com/a/27262797/780717[/url]

It probably would be a good idea for NVIDIA to mention this information in their official documentation somewhere, as I cannot find it in the existing manuals (maybe txbob has a link handy, in case I missed it). Based on Greg’s answer, it would seem the K80 still has 32-bit counters, as it is a Kepler part.

carcamovski · September 21, 2016, 10:22pm

Yes, I was thinking the same thing. I would be helpful if NVIDIA mention this somewhere, since I have to reference that in my thesis/paper. My idea when I started the profiling was to calculate the total FLOPS/s of my app with the biggest dataset that I have. Since I cannot do this, I will have to reduce the data :(.

carcamovski · September 21, 2016, 10:36pm

Thanks njuffa!

njuffa · September 21, 2016, 10:37pm

As far as I am concerned, the use of 32-bit profiling counters was simply a Bad Idea™. I don’t know what drove that decision. I cannot imagine that incrementation of wider counters would impose frequency limitations, but maybe it would.

On the other hand, it is not too surprising, because from my past involvement with building CPUs I recall that performance counter design is often a hurried afterthought, so one winds up with too small a number of simultaneous counters, any number of counter inaccuracies (most events counted cannot be based on one internal signal alone!), etc. In this case, the result were counters that are simply not wide enough for many real-life scenarios.

carcamovski · September 22, 2016, 9:01pm

Another came up to my head. When you have the effective bandwidth of every kernel how you compared it to the theoretical bandwidth? Also, I been looking for the Tesla K80 (only one GPU) GFLOPS peak, but I couldn’t find it. Does anyone have it? (I got the results reducing the data. Therefore, now I’m not getting the overflow)

BulatZiganshin · September 22, 2016, 10:51pm

If you don’t mind compiling one more executable, give a chance to this routine. I just call it from main and get nice result like that:

GeForce GTX 560 Ti, CC 2.1
VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s
8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS

njuffa · September 22, 2016, 11:33pm

I am not sure that one would want to compare the bandwidth of an application with the theoretical bandwidth (computed from interface width, interface frequency, and transfers/cycle). It probably makes more sense to compare it to the maximum achievable bandwidth, which is typically 75% to 85% of the theoretical bandwidth (measured, for example, when adding very long vectors of ‘double2’ elements).

carcamovski · September 23, 2016, 12:04am

How can I calculate the maximum achievable bandwidth?

On the other hand, I been seeing some slides from GTC and other people that calculate the bandwidth using DRAM Read/Write Throughput. However, in the CUDA Best practices manual they calculate the effective bandwidth using the Requested Global Load/Store Throughput. Which is the difference between these two? And how they can be compared?

For example, the kernel that takes more time in my app has a Bandwidth (calculated with DRAM Read/Write Throughput) of 0.029 GB/s. On the other hand the bandwidth calculated with Requested Global Load/Store Throughput is 282 GB/s. Is that possible?

carcamovski · September 23, 2016, 4:52pm

This bandwidth is the theoretical with ECC enabled or disabled?

BulatZiganshin · September 23, 2016, 4:53pm

theoretical. i have no cards with ECC so i don’t know how to compute that. feel free to edit the sources :)

Robert_Crovella · September 23, 2016, 5:00pm

a proxy for this is the device-to-device bandwidth reported by bandwidthTest. there is no calculation method, unless you simply want to use a scaling factor against peak theoretical bandwidth (which can be calculated).

Global loads and stores can hit in one of the caches, so they do not necessarily represent dram device bandwidth. The dram metrics (e.g. dram_utilization, dram_read_transactions, dram_write_transactions) should represent actual activity to the DRAM.

carcamovski · September 23, 2016, 5:04pm

Is 240 GB/s per GPU :).

carcamovski · September 23, 2016, 5:20pm

So the approach is to see the device-to-device bandwidth reported by bandwidthTest, the scale factor between that and the theoretical.

On the other hand, I been seeing some slides from GTC and other people that calculate the bandwidth using DRAM Read/Write Throughput. However, in the CUDA Best practices manual they calculate the effective bandwidth using the Requested Global Load/Store Throughput. Which is the difference between these two? And how they can be compared?

For example, the kernel that takes more time in my app has a Bandwidth (calculated with DRAM Read/Write Throughput) of 0.029 GB/s. On the other hand the bandwidth calculated with Requested Global Load/Store Throughput is 282 GB/s. Is that possible?

Global loads and stores can hit in one of the caches, so they do not necessarily represent dram device bandwidth. The dram metrics (e.g. dram_utilization, dram_read_transactions, dram_write_transactions) should represent actual activity to the DRAM.

Interesting… but then why would the CUDA C Best Practices calculates de effective bandwidth using the global loads and stores?

carcamovski · September 23, 2016, 5:26pm

Anyone knows why this HUGE difference?

Robert_Crovella · September 23, 2016, 5:46pm

Interesing… so you think you are getting 282GB/s of DRAM bandwidth on a device that has a peak theoretical of 240GB/s of DRAM bandwidth?

Because many of the global transactions are hitting in one of the caches, and only a small percentage actually have to be serviced by DRAM. This will be a function of your actual code, of course.

carcamovski · September 23, 2016, 5:58pm

It doesn’t make sense to have a 282GB/s bandwidth if the theoretical is 240 GB/s so I assume that is because what you said:

[/quote]

Thanks!