How to understand the L1 Achieved Value in the Floating Point Operations Roofline model?

tlimit · June 12, 2025, 12:00pm

Hi ~

How to understand the L1 Achieved Value in the Floating Point Operations Roofline model(Single Precision)?

In my understanding, it is calculated by dividing flops by number of bytes requested from L1/TEX cache.

In this case, I thought it should be 68719476736(smsp__sass_thread_inst_executed_op_fmul_pred_on.sum [inst])/274877906944(bytes requested from L1/TEX cache)=0.25, but ncu shows 0.12.

Could anyone help point out where my understanding is wrong?

and my code is very simple

  // M = N = K = 4096
  __global__ void matrix_multiplication_kernel_base(const float *A,
                                                    const float *B, float *C,
                                                    int M, int N, int K) {
    int k = blockIdx.x * blockDim.x + threadIdx.x;
    int m = blockIdx.y * blockDim.y + threadIdx.y;

    if (k >= K || m >= M) {
      return;
    }

    A += m * N;
    B += k;
    double sum = 0.0;
    for (int l = 0; l < N; ++l) {
      sum += A[l] * B[l * K];
    }
    C[m * K + k] = sum;
  }

  void solve_base(const float *A, const float *B, float *C, int M, int N, int K) {
    dim3 threads_per_block(16, 16);
    dim3 blocks_per_grid((K + threads_per_block.x - 1) / threads_per_block.x,
                         (M + threads_per_block.y - 1) / threads_per_block.y);

    only_A<<<blocks_per_grid, threads_per_block>>>(A, C, M, N, K);
    only_B<<<blocks_per_grid, threads_per_block>>>(B, C, M, N, K);
    matrix_multiplication_kernel_base<<<blocks_per_grid, threads_per_block>>>(
        A, B, C, M, N, K);
    CHECK_CUDA_ERROR();
    cudaDeviceSynchronize();
  }

tlimit · June 14, 2025, 4:50am

I got the description from a new version ncu.

Achieved Work: (smsp__sass_thread_inst_executed_op_fadd_pred_on.sum.per_cycle_elapsed + smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed + derived__smsp__sass_thread_inst_executed_op_ffma_pred_on_x2) * smsp__cycles_elapsed.avg.per_second

Achieved Traffic: derived__l1tex__lsu_writeback_bytes_mem_lg.sum.per_second

Arithmetic Intensity: Achieved Work / Achieved Traffic

Performance: Achieved Work

But I don’t the meaning of derived__l1tex__lsu_writeback_bytes_mem_lg, what is the difference between it and the metric Number of bytes requested from the L1/TEX cache from memory table.

Could anyone help to explain the meaning? Thanks very much ~

Topic		Replies	Views
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	3036	October 28, 2022
Roofline model's different chart's understanding Nsight Compute	0	1491	March 24, 2024
Overheads monitored by NCU for profiling DNN workloads Nsight Compute	2	114	February 28, 2025
Understanding Memory Tables and Roofline Modell Nsight Compute	3	639	August 19, 2022
Shared memory bank conflicts and nsight metric CUDA Programming and Performance	15	5511	October 19, 2024
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	465	January 15, 2024
NSight : How to calculate FLOP/s that's close to achieved FLOP/s CUDA Programming and Performance	3	3129	October 4, 2017
How to calculate achieved value in nsight compute's roofline for a kernel? Nsight Compute	4	175	May 23, 2025
Different achieved values in Roofline Nsight Compute	3	575	June 8, 2023
Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute Nsight Compute cuda , deep-learning-profiler , profiling	6	1537	January 16, 2024

How to understand the L1 Achieved Value in the Floating Point Operations Roofline model?

Related topics