Nvprof and Nsight returning different results for L1 and L2 cache hit rates

pandeyshweta2401 · July 8, 2019, 8:11pm

I am getting different cache stats for L1 and L2 after evaluating the same executable via nvprof and nsight compute.

The machine configurations are:
GPU : P40
CUDA version : 10.1

#include <stdio.h>
#include "cuda_profiler_api.h"

__global__ void initialization(int n, float a, float *x, float *y)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i<n/2) {
    x[i] = i;
    y[i] = 2*i;
  }
}

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
  int sum = 0;
  //if (i < n) sum = sum + y[i];
  //if (i == n) printf("%d", sum); 
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;

  cudaHostAlloc((void **)&x,  N*sizeof(float), cudaHostAllocMapped );
  cudaHostAlloc((void **)&y,  N*sizeof(float), cudaHostAllocMapped );

  cudaHostGetDevicePointer((void **)&d_x, x, 0);
  cudaHostGetDevicePointer((void **)&d_y, y, 0);

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaDeviceSynchronize();
  // Perform SAXPY on 1M elements
  //initialization<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
  cudaProfilerStart();
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
  cudaProfilerStop();
  cudaDeviceSynchronize();

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  cudaFreeHost(x);
  cudaFreeHost(y);
}

cuda compilation : nvcc -Xptxas -O0 -Xptxas -dlcm=cg -Xptxas -dscm=cg -o saxpy_orig_cg_cg saxpy.cu

nvvp results :
Unified cache hit rate : 50%
L2 cache hit rate : 33%

nsight compute results :
Unified cache hit rate : 0.00%
L2 cache hit rate : 42.23&

It would be really great if someone could help me with this.
Thanks in advance.

Topic		Replies	Views
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	721	August 13, 2019
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	718	October 12, 2021
Nsight and nvprof results have large differences Nsight Compute	9	1320	November 26, 2019
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1535	April 25, 2020
About DRAM stats Nsight Compute	6	1324	February 21, 2020
P100 global_hit_rate and and tex_cache_hit_rate CUDA Programming and Performance	6	966	November 4, 2018
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	3263	July 3, 2024
Can NVIDIA nsight compute profiler help me if I want to get the the L1/L2 hit rate for a specific lines of code in my kernel for the memory access? CUDA Programming and Performance	1	44	July 30, 2025
L1 cache hits 0% CUDA Programming and Performance	2	1170	June 1, 2013
Nsight Compute: discrepancy in cache reports for OptiX applications Nsight Compute	8	752	July 13, 2021

Nvprof and Nsight returning different results for L1 and L2 cache hit rates

Related topics