Measure NVLINK utilization for peer to peer access through Unified Virtual Addressing (UVA)

Hi there,

I’m trying to measure the NVLINK utilization when one GPU tries to read memory located on another GPU. I’ve enabled P2P access on UVA. My tests are on an 8GPU Hopper system with NVLINK interconnects and I use Nsight systems for profiling.

My command for profiling is as follows:

/tmp/nsight-systems-2023.3.1/bin/nsys profile --gpu-metrics-device=all

My test bench is as follows. I’m hardly seeing any NVLINK utilization even though the program outputs are correct. I’ve used very large sizes of arrays, so I expect to see sizeable memory transfer. Any help to understand this behavior will be really appreciated. Thanks!

#define WARP_SIZE 32

__global__
void saxpy(unsigned long long n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

__global__
void read(unsigned long long n, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  float x;
  
  if (i < n && i%WARP_SIZE==0) {
      x= y[0];
  }
}

int main(int argc, char **argv)
{ 
  if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <length array in exponent>" << std::endl;
        return -1;
  }
   
  unsigned long long N = ((unsigned long long) 1)<<18;
  unsigned long long exp = atoi(argv[1]);

  N = ((unsigned long long) 1)<<exp;

  std::cout<<"N "<<N<<"\n";
  
  float *x, *y;
  float *d_x, *d_y;

  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
   
  cudaError_t err = cudaSuccess;
  err = cudaSetDevice(0);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  }

  printf("\n Now in device 0\n");
 
  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  printf("Initializing vectors ");
  // Perform SAXPY on the elements
  saxpy<<<(N+1023)/1024, 1024>>>(N, 2.0f, d_x, d_y);
  
  
  // Moving to device 1
  err = cudaSetDevice(1);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 

  printf("\n Now in device 1\n");

  err = cudaDeviceEnablePeerAccess(0, 0);
  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 
  printf("Doing remote reads \n");
  read<<<(N+1023)/1024, 1024>>>(N, d_y);
  
  // Moving to device 0
  err = cudaSetDevice(0);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 
  
  cudaMemcpy(x, d_x, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);


  for (int i =0; i<5; i++)
    printf("\ny[%d]=%f\n", i, y[i]);

  cudaFree(d_x);
  cudaFree(d_y);

  free(x);
  free(y);

  return 0;

}

Your read kernel does not affect any visible global state. The compiler will likely optimize all the code out of it.

Please format code properly. You can do that in a simple fashion by editing your post - click the pencil icon below it, then select the code in the edit window, the click the </> button at the top of the edit window, then save your changes.

ohh, right…thanks, that makes sense.

Thanks for your suggestions earlier. I’ve recompiled my code in o0 and also stored the value of the ‘read’ in the read kernel in a volatile variable. By observing the ptx, I noticed that the loads were not optimized out by the compiler.

However, I still hardly see any NVLINK traffic. Any suggestions would be great.

the following is my updated test bench.

#define WARP_SIZE 32

__global__
void saxpy(unsigned long long n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

__global__
void read(unsigned long long n, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  volatile float x;
  
  if (i < n && i%WARP_SIZE==0) {
      x= y[0];
  }
}

int main(int argc, char **argv)
{ 
  if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <length array in exponent>" << std::endl;
        return -1;
  }
   
  unsigned long long N = ((unsigned long long) 1)<<18;
  unsigned long long exp = atoi(argv[1]);

  N = ((unsigned long long) 1)<<exp;

  std::cout<<"N "<<N<<"\n";
  
  float *x, *y;
  float *d_x, *d_y;

  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
   
  cudaError_t err = cudaSuccess;
  err = cudaSetDevice(0);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  }

  printf("\n Now in device 0\n");
 
  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  printf("Initializing vectors ");
  // Perform SAXPY on the elements
  saxpy<<<(N+1023)/1024, 1024>>>(N, 2.0f, d_x, d_y);
  
  
  // Moving to device 1
  err = cudaSetDevice(1);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 

  printf("\n Now in device 1\n");

  err = cudaDeviceEnablePeerAccess(0, 0);
  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 
  printf("Doing remote reads \n");
  read<<<(N+1023)/1024, 1024>>>(N, d_y);
  
  // Moving to device 0
  err = cudaSetDevice(0);

  if (err != cudaSuccess) {
   printf("ERROR: ");
   //printf(cudaGetErrorString(err));
   printf("\n"); 
  } 
  
  cudaMemcpy(x, d_x, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);


  for (int i =0; i<5; i++)
    printf("\ny[%d]=%f\n", i, y[i]);

  cudaFree(d_x);
  cudaFree(d_y);

  free(x);
  free(y);

  return 0;

}

That will mislead you. PTX goes thru an optimizing compiler stage before it becomes code the machine actually executes.

When I compile your code with -O0 (which is meaningless for the GPU device code, anyway), I observe that the SASS code for the read kernel has no global loads in it.

thanks for your comment. Very interesting. Sorry, these are probably noobie mistakes for multi-gpu programming.

doesn’t the use of volatile indicate that the assembly code shouldn’t be optimized out?

Can you maybe suggest how I can get the LOAD to show up after compilation? Should I insert a ptx instruction through asm volatile?

perhaps do:

  if (i < (n-1)) {
      y[i+1]= y[i];

This won’t necessarily preserve the data in the pattern you expect, of course. If you’re unable to parse what is going on here, or the change that I have made, you may wish to start with an orderly introduction to CUDA.

okay, yeah, need a write for them to show up. I was wondering if there was a way for reads to show up if they don’t impact global state in any way.

I’m not aware of any way. It might be the case that if you compile with -G (which does affect device code generation) that the reads may propagate all the way to SASS. However I don’t consider that a guarantee, and -G often has other side effects that people who are doing this kind of work may find objectionable (more instructions, reduced performance, etc.)

When working in this area, I am of the opinion that it is often best to verify your expectations of code generation by studying the SASS. PTX is mostly useless for this kind of verification.

thanks…really appreciate your advice on this. I’m trying something out.