Why the Peak FLOP/s in Nsight Compute is much less than white paper provided?

For NVIDIA A100 PCIE 40GB GPU: (I read the nvidia-ampere-architecture-whitepaper, for GA100)
I have using Nsight compute to get the Roofline and FLOP/s of a CUDA kernel.
The Peak FP64 Performance (FLOP/s) = 5.27TFLOP/s
**Peak FP64 Performance in white paper = 9.7TFLOPS ** (This is obtained under GPU boost clock, I found the boost clock of A100=1410MHz)

The Peak FP64 Performance (FLOP/s) (Not app /algorithm achieved )in Nsight compute is how to get ?

Usually Nsight Compute locks the GPU clock to the base frequency, in order to provide run to run repeatability.

To remove this lock, change the bottom setting in this window:

1 Like

Thank you a lot. Now the FLOP/s in Nsight is almost same with White paper said.
I have found that NVIDIA A100’s GPU Boost Clock=1410 MHz (ampere-architecture-white paper: page-36), Boost Clock can be get by this:

    cudaDeviceProp prop;
    CHECK_ERROR(cudaGetDeviceProperties(&prop, 1));
    clock_t clock_rate = prop.clockRate; // Clock frequency in kilohertz

How to get the base clock frequency of A100 GPU or others ?

nvidia-smi -q will give you the base clock, in the “Clocks” section.

One source is the Techpowerup database. Here is the spec for the 80GB PCIe version of the A100, the base clocks vary between models.

Thank you, very useful info, especially the Techpowerup database website.