Very confused about the number of bytes stored to HBM

Hi,

I worte a test case which launch a kernel 10 times and each time to store 10MB data to a different global memory address range (not overlapped). I suppose there should be 10MB data wrote to HBM each time, but actually, only ~2kB observed according to the metric: “dram__bytes_write.sum”.

PS: I disabled nsight compute’s cache flush behavior by using “–cache-control none” and I am sure that these 10 * 10MB data are wrote to HBM, since I did the DtoH copy after kernel launch and checked the result.

My confusion is why the L2 to HBM bytes reported by Asight Compute so different with the real number of bytes?

profile cmd:
ncu -o my_test --cache-control none --set full ./my_test

My kernel code:

__global__ void my_kernel(unsigned int *dst, unsigned int *src) {
  uint32_t gid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = gid; i < ELM_NUM; i = i + THREAD_NUM)
    dst[i] = 10;
}

And the host code snipplet:

for (int i = 0; i < KERNEL_ITER; i++) {
      my_kernel<<<gridDims, blkDims>>>((unsigned int *)d_D_array[i], (unsigned int*)d_A_array[i]);
  }
  cudaDeviceSynchronize();
  int failed = 0;
  for (int i = 0; i < KERNEL_ITER; i++) {
      cudaMemcpy(h_D_array[i], d_D_array[i], dst_sz, cudaMemcpyDeviceToHost);
      for (int j = 0; j < ELM_NUM; j++) {
        if (h_D_array[i][j] != 10) {
          printf("iter %d check failed, val = %d\n", i, h_D_array[i][j]);
          failed = 1;
          continue;
        }
      }
  }

  if (failed)
    printf("check failed\n");
  else
    printf("check ok\n");