P2p Bandwidth 150% higher than maximum achievable

Hello,

I have an issue regarding the bandwidth between my 2 GPUs (RTX A4500).

They are connected via PCIe 4.0 x 16, and my motherboard is a MBD-X12DPG-OA6.

Here is the output of the cuda sample p2pBandwidthLatencyTest:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 560.44  17.37 
     1  17.94 562.05 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 562.25  46.23 
     1  39.22 561.44 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 566.12  19.75 
     1  19.22 566.74 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 566.43  92.45 
     1  92.37 566.84 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   2.29  20.54 
     1  11.64   2.31 

   CPU     0      1 
     0   2.66   6.89 
     1   6.85   2.66 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   2.29   1.31 
     1   1.37   2.30 

   CPU     0      1 
     0   2.75   2.01 
     1   2.07   2.69 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NB: I have really close values with other means of measures, such as nvbandwidth and hand-made code.

NB2: If I use --sm_copy option, I reach ~ 250 GB/s for P2P Device 1 ↔ Device 0 Unidirectional

The maximum throughput of PCIe 4.0 x 16 is 32 GB/s, and I get measures of ~ 45 GB/s for P2P Device 1 ↔ Device 0 Unidirectional, which is supposed to be impossible.

Here is a copy of my nvidia-smi:

Thu Apr  6 17:19:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4500    On   | 00000000:4F:00.0 Off |                  Off |
| 30%   28C    P8    20W / 200W |      0MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4500    On   | 00000000:52:00.0 Off |                  Off |
| 30%   27C    P8     7W / 200W |      0MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

And a copy of my nvidia-smi topo -m:

	GPU0	GPU1	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity
GPU0	 X 	PXB	PXB	PXB	0-11,24-35	0
GPU1	PXB	 X 	PXB	PXB	0-11,24-35	0
mlx5_0	PXB	PXB	 X 	PIX		
mlx5_1	PXB	PXB	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Do you know what may cause this ? I suspect Cuda to sort of optimize transfers by “not-really” doing a real Device 0 → Device 1, but I don’t understand trully how etc…

Thank you in advance !

Wild speculation here: In an earlier post, (assuming this thread is about the same system), you had NVLink enabled.

Assuming you have removed the NVLink bridge, do you still have the --sli="mosaic" entry in your xorg.conf?

If so, I wondered if it could be contributing to the anomalous results.

I see your cards using NVLInk are capable of 112GB/s, (presumably 56GB/s each way). Out of curiosity, why are you no longer using it?

Thank you for replying.

The --sli="mosaic" is indeed still enabled in my xorg.config, because as far as I know, it is what enabled SLI. And according to my knowledge of how it works, NVLink needs SLI to work. Also, I suspect that If I disable the SLI, I will go back to the previous problem to you pointed.

And I’m currently trying to understand what’s happening for multiple reasons:

1- I had exactly the same bandwidth with and without NVLink, which do not make sense if the NVLink was functional.

2- In the output of nvidia-smi nvlink -c, the line about NVLink (I presume) was telling me that the “Link” was not functional.

3- If the numbers I’m seeing currently does not make sense, how could I trust those with NVLink activated ?

So, I’d like to use my NVLink, it’s more about it not working as expected.

My misunderstanding. What made me think you were no longer using NVLink, is that in the previous thread, the output from nvidia-smi topo -m shows NV4 for both cards, but in the output above, it shows PXB for both, so I assumed you had removed the NVLink bridge.

I guess you’ve seen Robert’s recent remarks on this topic.

Yes I have, and as you can see in the output of p2pBandwidthLatencyTest, both devices can peer access each other.

1 Like

That’s not correct. And furthermore I generally recommend to disable SLI with CUDA usage, unless you have a valid, specific reason to have it enabled, for the reasons covered in the programming guide. However I can’t explain what transpired in your previous post nor what is happening here.

Thank you for correcting me. I am not sure about what SLI really is, what the --sli="mosaic" really does.
Anyway, after this day of investigation, I ran the simpleP2P cuda samples, and this happened:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A4500 (GPU0) -> NVIDIA RTX A4500 (GPU1) : Yes
> Peer access from NVIDIA RTX A4500 (GPU1) -> NVIDIA RTX A4500 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 36.23GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access...
Shutting down...
Test failed!

Before finding this issue, I was trying to code something myself, ended up with this :

#include <cuda_runtime.h>

#include <iostream>

// Macros to check cuda errors

#define CUDA_CALL(call)                                                                     \
  {                                                                                         \
    cudaError_t result = call;                                                              \
    if (cudaSuccess != result) {                                                            \
      std::cerr << "CUDA error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " \
                << cudaGetErrorString(result) << " (" << #call << ")" << std::endl;         \
      exit(EXIT_FAILURE);                                                                   \
    }                                                                                       \
  }

#define CHK_CUDA_ERR                                                                              \
    {                                                                                             \
        cudaError_t error = cudaGetLastError();                                                   \
        if (cudaSuccess != error) {                                                              \
            std::cerr << "CUDA error " << error << " in " << __FILE__ << ":" << __LINE__ << ": " \
                    << cudaGetErrorString(error) << std::endl;           \
            exit(EXIT_FAILURE);                                                                   \
        }                                                                                         \
    }

// Kernels to increment

__global__ void plus_one(int *m){

    *m = *m + 1;

}

__global__ void plus(int *in, int *out){
    *out = in[1] + 1;
}

int main(){

    // Constants init

    int N = 1;

    int dimGrid = 1;
    int dimBlock = N;

    int size = N * sizeof(int);

    // Value storages

    int *d0, *d1;

    int verif[2];
    
    // Initialization for device 0

    CUDA_CALL(cudaSetDevice(0))
    CUDA_CALL(cudaMalloc(&d0, size))
    CUDA_CALL(cudaMemset(d0, 0, size))
    CUDA_CALL(cudaDeviceEnablePeerAccess(1,0))

    // Initialization for device 1

    CUDA_CALL(cudaSetDevice(1))
    CUDA_CALL(cudaMalloc(&d1, size))
    CUDA_CALL(cudaMemset(d1, 0, size))
    CUDA_CALL(cudaDeviceEnablePeerAccess(0,0))

    // Incrementation on Device 0

    CUDA_CALL(cudaSetDevice(0))
    
    plus_one<<<dimGrid, dimBlock>>>(d0);
    CHK_CUDA_ERR

    // Incrementation on top of the previous one on Device 1

    CUDA_CALL(cudaMemcpyPeer(d1, 1, d0, 0, size))

    CUDA_CALL(cudaSetDevice(1))
    plus_one<<<dimGrid, dimBlock>>>(d1);
    CHK_CUDA_ERR
    CUDA_CALL(cudaSetDevice(0))

    // Pull values for verification

    CUDA_CALL(cudaMemcpy(verif, d0, size, cudaMemcpyDeviceToHost))
    
    CUDA_CALL(cudaMemcpy(verif + 1, d1, size, cudaMemcpyDeviceToHost))

    // Print in stdout

    std::cout << "On Device 0 (should be 1): " << verif[0] << std::endl;
    std::cout << "On Device 1 (should be 2): " << verif[1] << std::endl;

    // Frees + disable peer access

    CUDA_CALL(cudaSetDevice(0))
    CUDA_CALL(cudaFree(d0))
    CUDA_CALL(cudaDeviceDisablePeerAccess(1))

    CUDA_CALL(cudaSetDevice(1))
    CUDA_CALL(cudaFree(d1))
    CUDA_CALL(cudaDeviceDisablePeerAccess(0))

}

And this output :

On Device 0 (should be 1): 1
On Device 1 (should be 2): 1

Important fact: Commenting all cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess change the output for the expected one. So peer to peer is definitely broken here.

If you have any clues, that would help a lot. I will try Tuesday to disable the SLI, but I suspect the issue of my previous post to be back.

In any case, thank you for the time you invest helping me.

Hi Antoine,

Just out of interest, does your system pass the PCI bridge test as outlined at the very end of this document?

I’m guessing it probably does, given other results you’ve presented so far.

I know that you disabled VT-d and ACS in the BIOS, but if you haven’t already tried, it might be worth testing with “SR-IOV Support” and “ARI Support” disabled in the “PCIe/PCI/PnP Configuration”.

Just out of interest, does your system pass the PCI bridge test as outlined at the very end of [this] document?

Yes, the output is pretty similar to the “good” version.

I know that you disabled VT-d and ACS in the BIOS, but if you haven’t already tried, it might be worth testing with “SR-IOV Support” and “ARI Support” disabled in the “PCIe/PCI/PnP Configuration”.

I just tried and it does not help.

I will try to use my GPUs on an other computer and see if I can reproduce these numbers.

Thank you anyway.

So, I have tried my setup on an other computer and the results seem better:

[./workspace/antoine/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A4500 (GPU0) -> NVIDIA RTX A4500 (GPU1) : Yes
> Peer access from NVIDIA RTX A4500 (GPU1) -> NVIDIA RTX A4500 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 48.92GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

The output of my previous code :

On Device 0 (should be 1): 1
On Device 1 (should be 2): 2

The driver version installed is 515.43.04 and the motherboard is Dell 02M8NY A04.

I do not think I will do more investigation for the moment, I just do not understand what is happening behind the scene. Even if my issue is not resolved on my main motherboard, I will deal with it.

Thank you.