Roofline Model for Nvidia GTX1080

I am trying to profile Single Precision GEMM with different square matrix sizes from n=256 to 16K, and I monitor Floating Point utilization, execution time and Dram read throughput.
I am getting different results for utilization vs throughput, since utilization is saturated at n=2048 while throughput keeps going up till n=8K, any ideas about why the throughput is not saturated with FPUs utilization?
Profiling is done using nvprof and these metrics were used to report:
print-gpu-trace for kernel execution time.
I am attaching figures that describe the problem.

All metrics are measured across all SMs.

If a kernel launches 1 block (1024 threads) this will run on 1 SM. Let’s assume the kernel can achieve near 100% FP32 utilization.

  • utilization - maximum value is 100% SUM_SM(sm_pipe_active_fp32) / SUM_SM(active_cycles)
  • throughput - maximum value is 1/N% SUM_SM(flop) / SUM_SM(elapsed_cycles)

For utilization inactive SMs will add 0 to the denominator.
For throughput the inactive SMs contribute full cost to the denominator.

The metric sm_efficiency can be used to determine SM load balancing issues.

A full NVVP report (or Nsight on Windows) will likely help determine possible bottlenecks.

1 Like