Question about bandwidth test

I was to test my kernal bandwidth, but i find what ncu measured was very different from what I measured.
My test code as follows, GPU is A800, CUDA version is 12.

#include <cuda_runtime.h>
#include <stdio.h>

#define BDIMX 256
#define BDIMY 4

__global__ void copyRow(float * in,float * out,int nx,int ny)
{
    int ix=threadIdx.x+blockDim.x*blockIdx.x*4;
    int iy=threadIdx.y+blockDim.y*blockIdx.y;
    int idx=ix+iy*nx;
    if (ix<nx && iy<ny)
    {
      out[idx]=in[idx];
      out[idx+1*blockDim.x]=in[idx+1*blockDim.x];
      out[idx+2*blockDim.x]=in[idx+2*blockDim.x];
      out[idx+3*blockDim.x]=in[idx+3*blockDim.x];

    }
}
int main(int argc,char** argv)
{
  printf("strating...\n");
  int nx=1<<12;
  int ny=1<<12;
  int dimx=BDIMX;
  int dimy=BDIMY;
  int nxy=nx*ny;
  int nBytes=nxy*sizeof(float);
  int transform_kernel=0;

  //Malloc
  float* A_host=(float*)malloc(nBytes);
  float* B_host_cpu=(float*)malloc(nBytes);
  float* B_host=(float*)malloc(nBytes);
  initialData(A_host,nxy);

  //cudaMalloc
  float *A_dev=NULL;
  float *B_dev=NULL;
  CHECK(cudaMalloc((void**)&A_dev,nBytes));
  CHECK(cudaMalloc((void**)&B_dev,nBytes));

  CHECK(cudaMemcpy(A_dev,A_host,nBytes,cudaMemcpyHostToDevice));
  CHECK(cudaMemset(B_dev,0,nBytes));

  // 2d block and 2d grid
  dim3 block(dimx,dimy);
  dim3 grid_1((nx-1)/(block_1.x*4)+1,(ny-1)/block_1.y+1);
  copyRow<<<grid_1,block>>>(A_dev,B_dev,nx,ny);
  CHECK(cudaDeviceSynchronize());
  CHECK(cudaMemcpy(B_host,B_dev,nBytes,cudaMemcpyDeviceToHost));

  cudaFree(A_dev);
  cudaFree(B_dev);
  free(A_host);
  free(B_host);
  free(B_host_cpu);
  cudaDeviceReset();
  return 0;
}

The kernel time measured by nsys is 91009 ns.

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                 Name                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -----------------------------------
    100.0            91009          1   91009.0   91009.0     91009     91009          0.0  copyRow(float *, float *, int, int)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation     
 --------  ---------------  -----  ----------  ----------  --------  --------  -----------  ------------------
     81.2         34004415      1  34004415.0  34004415.0  34004415  34004415          0.0  [CUDA memcpy DtoH]
     18.7          7816603      1   7816603.0   7816603.0   7816603   7816603          0.0  [CUDA memcpy HtoD]
      0.1            38272      1     38272.0     38272.0     38272     38272          0.0  [CUDA memset]     

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
     67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memcpy DtoH]
     67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memcpy HtoD]
     67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memset]   

So the bandwidth = 67.108 * 2 / 1000 / 0.000091009 = 1474.77GB/s.

But the bandwidth measured by ncu as follows:

Using device 0: NVIDIA A800 80GB PCIe
CPU Execution Time elapsed 0.488664 sec
==PROF== Profiling "copyRow" - 0: 0%....50%....100% - 1 pass
copyRow  Time elapsed 0.588242 sec
Results don't match!
115.000000(gpu_result[1] )!= 126.000000(cpu_result[1])
==PROF== Disconnected from process 3128802
[3128802] a.out@127.0.0.1
  copyRow(float *, float *, int, int) (4, 1024, 1)x(256, 4, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: Command line profiler metrics
    -------------------------------- ------------ ------------
    Metric Name                       Metric Unit Metric Value
    -------------------------------- ------------ ------------
    dram__bytes_read.sum.per_second  Gbyte/second       729.49
    dram__bytes_write.sum.per_second Gbyte/second       594.29
    -------------------------------- ------------ ------------

The bandwidth tested by ncu is 729.49+594.29 = 1323.78GB/s

1323.78 / 1474.77 = 89.76%, this is not a negligible error, What went wrong?

You probably need to consider L2 traffic as well. This could explain why DRAM traffic is lower. The DRAM write bandwidth seems to be noticeably lower, whereas if you doubled the read bandwidth it would be pretty close to calculation.

When the kernel is complete, I’m not sure it is a given that all L2 cache lines have been flushed to DRAM.

I’m sorry, I don’t quite understand what that means.

Please refer to the memory hierarchy diagram from the profiler.

When your kernel code writes to the logical global space (issuing write operations in the upper left hand green box labelled global), that activity will land in the L2 cache. The L2 cache is often described as a “write back” cache, which means that write traffic to the L2 does not immediately/automatically generate DRAM traffic. DRAM write traffic gets generated when L2 cache lines get flushed(“evicted”) to DRAM.

In one of your measurements, you are calculating the total activity passing through the green box in the upper left hand corner. In the other measurement, you are measuring the traffic flowing on the path that connects to “Device Memory” blue box on the right hand side of the diagram. These two traffic measurements are not guaranteed to be automatically, instantaneously, and identically equal.

The L2 cache has a fixed size, so my guess would be that the discrepancy (expressed as a percentage) in the two measurements decreases as you increase the amount of data moved. Try moving 10GB instead of 67MB.

@Robert_Crovella I tried to move more data(about 1 GB) to test bandwdith, code as follows:

#include <cuda_runtime.h>
#include <stdio.h>

#define BDIMX 256
#define BDIMY 4 

__global__ void copyRow(float * in,float * out,int nx,int ny)
{
    int ix=threadIdx.x+blockDim.x*blockIdx.x*4;
    int iy=threadIdx.y+blockDim.y*blockIdx.y;
    int idx=ix+iy*nx;
    if (ix<nx && iy<ny)
    {
      out[idx]=in[idx]+1;
      out[idx+blockDim.x*1]=in[idx+blockDim.x*1]+1;
      out[idx+blockDim.x*2]=in[idx+blockDim.x*2]+1;
      out[idx+blockDim.x*3]=in[idx+blockDim.x*3]+1;

    }
}

int main(int argc,char** argv)
{
  printf("strating...\n");
  int nx=1<<14;
  int ny=1<<14;
  int dimx=BDIMX;
  int dimy=BDIMY;
  int nxy=nx*ny;
  int nBytes=nxy*sizeof(float);

  //Malloc
  float* A_host=(float*)malloc(nBytes);
  float* B_host_cpu=(float*)malloc(nBytes);
  float* B_host=(float*)malloc(nBytes);

  //cudaMalloc
  float *A_dev=NULL;
  float *B_dev=NULL;
  cudaMalloc((void**)&A_dev,nBytes);
  cudaMalloc((void**)&B_dev,nBytes);

  // 2d block and 2d grid
  dim3 block(dimx,dimy);
  dim3 grid_1((nx-1)/(block.x*4)+1,(ny-1)/block.y+1);

  cudaMemcpy(A_dev,A_host,nBytes,cudaMemcpyHostToDevice);
  cudaMemset(B_dev,0,nBytes);
  copyRow<<<grid_1,block>>>(A_dev,B_dev,nx,ny);
  cudaMemcpy(B_host,B_dev,nBytes,cudaMemcpyDeviceToHost);

  cudaFree(A_dev);
  cudaFree(B_dev);
  free(A_host);
  free(B_host);
  free(B_host_cpu);
  cudaDeviceReset();
  return 0;
}

I got nsys time as follows:

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                 Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------------------------------
    100.0          1381866          1  1381866.0  1381866.0   1381866   1381866          0.0  copyRow(float *, float *, int, int)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)      Operation
 --------  ---------------  -----  -----------  -----------  ---------  ---------  -----------  ------------------
     74.8        933621105      1  933621105.0  933621105.0  933621105  933621105          0.0  [CUDA memcpy DtoH]
     25.2        314453684      1  314453684.0  314453684.0  314453684  314453684          0.0  [CUDA memcpy HtoD]
      0.0           563077      1     563077.0     563077.0     563077     563077          0.0  [CUDA memset]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
   1073.742      1  1073.742  1073.742  1073.742  1073.742        0.000  [CUDA memcpy DtoH]
   1073.742      1  1073.742  1073.742  1073.742  1073.742        0.000  [CUDA memcpy HtoD]
   1073.742      1  1073.742  1073.742  1073.742  1073.742        0.000  [CUDA memset]

I got ncu bandwidth as follows:

  copyRow(float *, float *, int, int) (16, 4096, 1)x(256, 4, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: Command line profiler metrics
    -------------------------------- ------------ ------------
    Metric Name                       Metric Unit Metric Value
    -------------------------------- ------------ ------------
    dram__bytes_read.sum.per_second  Gbyte/second       736.29
    dram__bytes_write.sum.per_second Gbyte/second       727.80
    -------------------------------- ------------ ------------

Bandwidth tested by nsys is 2.14/1000/0.001381866 = 1550GB/s, bandwidth tested by ncu is 736 + 727 = 1463GB/s.

There still difference nearly 100GB/s.

I tried to minimize the impact of this part by increase data size, but i cant’t understand why there still has gap in bandwidth.

I really wonder which method is relatively accurate to calculate bandwidth.

1.check to make sure the kernel duration in nsys is the same as the kernel duration in ncu

  1. check on the nsight compute or nsight systems forums whether the accounting for all these gigabytes and megabytes is the same. I mean your calculation seems to indicate 1000 megabytes in a gigabyte, but I think that number is actually 1024. And yes, I understand that does not account for a full 10% difference, but it may be that in one case the gigabyte is 1000x1000x1000 bytes and in the other case it is 1024x1024x1024 bytes. I don’t happen to recall, and if I were trying to resolve this difference, I would want to rule a calculation issue like that out.

in the situation now, the metric numbers are all consistent with each other. Both read and write bandwidth are around 730GB/s . But 40MB of unwritten data could still acount for ~4% difference out of a gigabyte, which is why I suggested 10GB.

calculation-wise, the difference between 1000x1000x1000 bytes and 1024x1024x1024 bytes used as the unit basis seems to give a fairly close approximation of the numbers presented.

2*1073741824/0.001381866 = 1554 GB/s
2*/0.001381866 = 1447 GB/s

So if a GB = 1000x1000x1000 bytes, then the calculated measurement would be 1554 GB/s. If a GB = 1024x1024x1024 bytes, the calculated measurement would be 1447 GB/s

I thought we can get answer from nsys output as follows:

In this test, 2^30 byte data from host transferred to device, if 1GB = 1024* 1024* 1024 byte, then Total(MB) above the table should be 1024MB, instead of 1073.742MB.

When 1GB = 1000 * 1000 * 1000 byte, 2^30 byte = 1024 * 1024 * 1024 / 1000 / 1000 = 1073.74MB, this matches exactly the Total(MB)data in the table above.

In ncu, I find 1.28TB/s = 1277008927545 byte/s, I believe ncu is also 1GB = 1000 * 1000 * 1000 byte.

So I think kernel duration in nsys and ncu is the key to the problem, but I don’t kown how to confirm it.

For ncu, the kernel duration is listed in the SOL report section, which is usually the first report section (above the memory workload analysis section that you have depicted).

you already have the nsys measured kernel duration. That is what you are using in your calculations.