Question about PCI-E transfer throughput

I try this code on A100 with PCIE-4.0, I want to know the H2D and D2H throughput.

I get 24 GB/s for each run when running them singly. But the throughput gets much lower when running D2H and H2D together. Can you give some advice?

H2D Transfer Time: 0.0783002 seconds
D2H Transfer Time: 0.0808056 seconds
Parallel H2D and D2H Transfer Time: 0.122456 seconds
H2D Bandwidth: 24.944 GB/s
D2H Bandwidth: 24.1707 GB/s
Parallel Bandwidth: 15.9496 GB/s
#include <iostream>
#include <cuda_runtime.h>
#include <chrono>

void checkCudaError(cudaError_t err, const char* msg) {
    if (err != cudaSuccess) {
        std::cerr << msg << ": " << cudaGetErrorString(err) << std::endl;
        exit(-1);
    }
}

int main() {
    size_t dataSize = 500 * (1 << 20);  // 数据大小: 50MB,可以修改为其他大小

    // 1. Host and Device 
    float *h_data = nullptr, *d_data = nullptr;

    // pinned memory
    checkCudaError(cudaMallocHost(&h_data, dataSize * sizeof(float)), "cudaMallocHost failed");
    checkCudaError(cudaMalloc(&d_data, dataSize * sizeof(float)), "cudaMalloc failed");

    // init
    for (size_t i = 0; i < dataSize; ++i) {
        h_data[i] = static_cast<float>(i);
    }

    checkCudaError(cudaMemcpy(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice), "H2D memcpy failed");
    checkCudaError(cudaMemcpy(h_data, d_data, dataSize * sizeof(float), cudaMemcpyDeviceToHost), "D2H memcpy failed");

    // 2. H2D (Host to Device) 
    auto start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpy(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice), "H2D memcpy failed");
    auto end = std::chrono::high_resolution_clock::now();
    auto durationH2D = std::chrono::duration<double>(end - start).count();
    std::cout << "H2D Transfer Time: " << durationH2D << " seconds" << std::endl;

    // 3. D2H (Device to Host) 
    start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpy(h_data, d_data, dataSize * sizeof(float), cudaMemcpyDeviceToHost), "D2H memcpy failed");
    end = std::chrono::high_resolution_clock::now();
    auto durationD2H = std::chrono::duration<double>(end - start).count();
    std::cout << "D2H Transfer Time: " << durationD2H << " seconds" << std::endl;

    // 4.
    cudaStream_t stream1, stream2;
    checkCudaError(cudaStreamCreate(&stream1), "cudaStreamCreate failed");
    checkCudaError(cudaStreamCreate(&stream2), "cudaStreamCreate failed");

    cudaDeviceSynchronize();
    // stream1  H2D 
    start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpyAsync(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice, stream1), "Async H2D failed");

    //  stream2  D2H  
    checkCudaError(cudaMemcpyAsync(h_data, d_data, dataSize * sizeof(float), cudaMemcpyDeviceToHost, stream2), "Async D2H failed");

    // wait
    checkCudaError(cudaStreamSynchronize(stream1), "Stream1 sync failed");
    checkCudaError(cudaStreamSynchronize(stream2), "Stream2 sync failed");

    end = std::chrono::high_resolution_clock::now();
    auto durationParallel = std::chrono::duration<double>(end - start).count();
    std::cout << "Parallel H2D and D2H Transfer Time: " << durationParallel << " seconds" << std::endl;

    //(GB/s)
    double h2d_bandwidth = dataSize * sizeof(float) / durationH2D / (1 << 30);  // GB/s
    double d2h_bandwidth = dataSize * sizeof(float) / durationD2H / (1 << 30);  // GB/s
    double parallel_bandwidth = dataSize * sizeof(float) / durationParallel / (1 << 30);  // GB/s

    std::cout << "H2D Bandwidth: " << h2d_bandwidth << " GB/s" << std::endl;
    std::cout << "D2H Bandwidth: " << d2h_bandwidth << " GB/s" << std::endl;
    std::cout << "Parallel Bandwidth: " << parallel_bandwidth << " GB/s" << std::endl;

    // clear
    checkCudaError(cudaFree(d_data), "cudaFree failed");
    checkCudaError(cudaFreeHost(h_data), "cudaFreeHost failed");

    return 0;
}

Use separate buffers for the host->device and the device->host transfers. Your current code involves a conflict (or competition) in that the source of the transfer in one direction is simultaneously the destination of the transfer in the opposite direction. You could also use non-overlapping segments of the same host and device allocations.

checkCudaError(cudaMemcpyAsync(d_data_1, h_data_1, dataSize * sizeof(float), cudaMemcpyHostToDevice, stream1), "Async H2D failed");
checkCudaError(cudaMemcpyAsync(h_data_2, d_data_2, dataSize * sizeof(float), cudaMemcpyDeviceToHost, stream2), "Async D2H failed");

I believe the throughput gets a bit higher, although perhaps it is subject to interpretation of words.

Bandwidth usually has a definition of bytes transferred divided by the transfer time.

In the bidirectional case, you are actually getting a “bandwidth” of about 32GB/s, using a bidirectional definition of bytes transferred. So compared to the 25GB/s number, the aggregate throughput is actually somewhat higher, but the per-direction throughput is lower. Without knowing the specification of your system, a few other things to make sure of (in addition to the previous reply/comments):

  1. that the CPU memory bandwidth is high enough. It must be high enough to sustain the desired concurrent transfers, in order to see the PCIE link operate in the best light.
  2. that the server topology is being used in the best fashion. If this is a multi-CPU-socket server, then use process pinning to make sure that the GPU in question is in “affinity” or “close” to the CPU being used. There are many questions on this forum on this topic, and the command nvidia-smi topo -m may provide useful data for your specific case/server.

Hi, I modify the code and the result has no change.

int main() {
    size_t dataSize = 500 * (1 << 20);  // 数据大小: 50MB,可以修改为其他大小

    // 1. 分配 Host 和 Device 内存
    float *h_data = nullptr, *d_data = nullptr;
    float *h_data_1 = nullptr, *d_data_1 = nullptr;

    // 使用 pinned memory 分配
    checkCudaError(cudaMallocHost(&h_data, dataSize * sizeof(float)), "cudaMallocHost failed");
    checkCudaError(cudaMalloc(&d_data, dataSize * sizeof(float)), "cudaMalloc failed");
    checkCudaError(cudaMallocHost(&h_data_1, dataSize * sizeof(float)), "cudaMallocHost failed");
    checkCudaError(cudaMalloc(&d_data_1, dataSize * sizeof(float)), "cudaMalloc failed");

    // 初始化数据
    for (size_t i = 0; i < dataSize; ++i) {
        h_data[i] = static_cast<float>(i);
    }
    for (size_t i = 0; i < dataSize; ++i) {
        h_data_1[i] = static_cast<float>(i);
    }

    checkCudaError(cudaMemcpy(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice), "H2D memcpy failed");
    checkCudaError(cudaMemcpy(h_data, d_data, dataSize * sizeof(float), cudaMemcpyDeviceToHost), "D2H memcpy failed");

    // 2. 测量 H2D (Host to Device) 传输时间
    auto start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpy(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice), "H2D memcpy failed");
    auto end = std::chrono::high_resolution_clock::now();
    auto durationH2D = std::chrono::duration<double>(end - start).count();
    std::cout << "H2D Transfer Time: " << durationH2D << " seconds" << std::endl;

    // 3. 测量 D2H (Device to Host) 传输时间
    start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpy(h_data, d_data, dataSize * sizeof(float), cudaMemcpyDeviceToHost), "D2H memcpy failed");
    end = std::chrono::high_resolution_clock::now();
    auto durationD2H = std::chrono::duration<double>(end - start).count();
    std::cout << "D2H Transfer Time: " << durationD2H << " seconds" << std::endl;

    // 4. 测量并行传输性能
    cudaStream_t stream1, stream2;
    checkCudaError(cudaStreamCreate(&stream1), "cudaStreamCreate failed");
    checkCudaError(cudaStreamCreate(&stream2), "cudaStreamCreate failed");
    cudaDeviceSynchronize();
    // 在 stream1 上进行 H2D 传输
    start = std::chrono::high_resolution_clock::now();
    checkCudaError(cudaMemcpyAsync(d_data, h_data, dataSize * sizeof(float), cudaMemcpyHostToDevice, stream1), "Async H2D failed");

    // 在 stream2 上进行 D2H 传输
    checkCudaError(cudaMemcpyAsync(h_data_1, d_data_1, dataSize * sizeof(float), cudaMemcpyDeviceToHost, stream2), "Async D2H failed");

    // 等待传输完成
    checkCudaError(cudaStreamSynchronize(stream1), "Stream1 sync failed");
    checkCudaError(cudaStreamSynchronize(stream2), "Stream2 sync failed");

    end = std::chrono::high_resolution_clock::now();
    auto durationParallel = std::chrono::duration<double>(end - start).count();
    std::cout << "Parallel H2D and D2H Transfer Time: " << durationParallel << " seconds" << std::endl;

    // 输出带宽(GB/s)
    double h2d_bandwidth = dataSize * sizeof(float) / durationH2D / (1 << 30);  // GB/s
    double d2h_bandwidth = dataSize * sizeof(float) / durationD2H / (1 << 30);  // GB/s
    double parallel_bandwidth = dataSize * sizeof(float) / durationParallel / (1 << 30);  // GB/s

    std::cout << "H2D Bandwidth: " << h2d_bandwidth << " GB/s" << std::endl;
    std::cout << "D2H Bandwidth: " << d2h_bandwidth << " GB/s" << std::endl;
    std::cout << "Parallel Bandwidth: " << parallel_bandwidth << " GB/s" << std::endl;

    // 清理
    checkCudaError(cudaFree(d_data), "cudaFree failed");
    checkCudaError(cudaFreeHost(h_data), "cudaFreeHost failed");
    checkCudaError(cudaFree(d_data_1), "cudaFree failed");
    checkCudaError(cudaFreeHost(h_data_1), "cudaFreeHost failed");

    return 0;
}

Yes, it’s higher in fact. I expect it close to twice as former.

nvidia-smi topo -m shows this:

        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      24-35   2               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

CPU memory info:

dmidecode --type 17 | grep -i "Speed"
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s
        Configured Memory Speed: 3200 MT/s
        Speed: 3200 MT/s

Is that enough? Thank you!

I tried this, is CPU memory a limit?

mbw 1000
Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MEMCPY  Elapsed: 0.09488        MiB: 1000.00000 Copy: 10539.296 MiB/s
1       Method: MEMCPY  Elapsed: 0.09400        MiB: 1000.00000 Copy: 10638.524 MiB/s
2       Method: MEMCPY  Elapsed: 0.09402        MiB: 1000.00000 Copy: 10636.374 MiB/s
3       Method: MEMCPY  Elapsed: 0.09407        MiB: 1000.00000 Copy: 10630.382 MiB/s
4       Method: MEMCPY  Elapsed: 0.09402        MiB: 1000.00000 Copy: 10636.487 MiB/s
5       Method: MEMCPY  Elapsed: 0.09420        MiB: 1000.00000 Copy: 10615.824 MiB/s
6       Method: MEMCPY  Elapsed: 0.09405        MiB: 1000.00000 Copy: 10632.077 MiB/s
7       Method: MEMCPY  Elapsed: 0.09397        MiB: 1000.00000 Copy: 10641.807 MiB/s
8       Method: MEMCPY  Elapsed: 0.09389        MiB: 1000.00000 Copy: 10650.308 MiB/s
9       Method: MEMCPY  Elapsed: 0.09411        MiB: 1000.00000 Copy: 10626.315 MiB/s
AVG     Method: MEMCPY  Elapsed: 0.09412        MiB: 1000.00000 Copy: 10624.655 MiB/s
0       Method: DUMB    Elapsed: 0.05595        MiB: 1000.00000 Copy: 17872.143 MiB/s
1       Method: DUMB    Elapsed: 0.05490        MiB: 1000.00000 Copy: 18215.600 MiB/s
2       Method: DUMB    Elapsed: 0.05484        MiB: 1000.00000 Copy: 18233.203 MiB/s
3       Method: DUMB    Elapsed: 0.05506        MiB: 1000.00000 Copy: 18163.325 MiB/s
4       Method: DUMB    Elapsed: 0.05491        MiB: 1000.00000 Copy: 18209.961 MiB/s
5       Method: DUMB    Elapsed: 0.05478        MiB: 1000.00000 Copy: 18255.837 MiB/s
6       Method: DUMB    Elapsed: 0.05491        MiB: 1000.00000 Copy: 18210.956 MiB/s
7       Method: DUMB    Elapsed: 0.05482        MiB: 1000.00000 Copy: 18240.852 MiB/s
8       Method: DUMB    Elapsed: 0.05484        MiB: 1000.00000 Copy: 18234.533 MiB/s
9       Method: DUMB    Elapsed: 0.05486        MiB: 1000.00000 Copy: 18227.885 MiB/s
AVG     Method: DUMB    Elapsed: 0.05499        MiB: 1000.00000 Copy: 18185.786 MiB/s
0       Method: MCBLOCK Elapsed: 0.07830        MiB: 1000.00000 Copy: 12771.881 MiB/s
1       Method: MCBLOCK Elapsed: 0.08693        MiB: 1000.00000 Copy: 11503.773 MiB/s
2       Method: MCBLOCK Elapsed: 0.07910        MiB: 1000.00000 Copy: 12642.545 MiB/s
3       Method: MCBLOCK Elapsed: 0.07892        MiB: 1000.00000 Copy: 12671.059 MiB/s
4       Method: MCBLOCK Elapsed: 0.08655        MiB: 1000.00000 Copy: 11553.615 MiB/s
5       Method: MCBLOCK Elapsed: 0.07911        MiB: 1000.00000 Copy: 12640.787 MiB/s
6       Method: MCBLOCK Elapsed: 0.07920        MiB: 1000.00000 Copy: 12626.900 MiB/s
7       Method: MCBLOCK Elapsed: 0.08642        MiB: 1000.00000 Copy: 11571.128 MiB/s
8       Method: MCBLOCK Elapsed: 0.07930        MiB: 1000.00000 Copy: 12610.340 MiB/s
9       Method: MCBLOCK Elapsed: 0.07900        MiB: 1000.00000 Copy: 12658.709 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.08128        MiB: 1000.00000 Copy: 12302.847 MiB/s

Please also follow up on Robert Crovella’s suggestions. What are the specifications of the host system?

PCIe is a full-duplex interconnect, so given a PCIe4 x16 connection, you should be able to transfer ≈25 GB/sec in each direction simultaneously, provided the transfer size is sufficiently large (because PCIe uses packetized transport, there is additional overhead for small transfer sizes). The comments in your code suggest size of each the transfer is 50 MB, which should be sufficient to achieve the maximum PCIe throughput.

Any host system with a A100 GPU installed should have more than sufficient system memory bandwidth to sustain 50 GB/sec of data transfers from PCIe (typically the server systems hosting an A100 have system memory bandwidth in the hundreds of GB/sec, but let’s look at the specs of your host system). You might want to re-run with a larger transfer size, to see if that changes the throughput numbers at all. The expectations would be that the same throughput is observed when using 100MB or 200MB transfers.

A sanity check: this system is otherwise idle when you are running your benchmark, correct? If so we can exclude any interference from other applications running on the system (e.g. contention for system memory or PCIe resources).

Your test framework does not seem to be using a loop for the measurements. To avoid “cold start” effects on first-time execution of any task, it is a best practice to repeat the timed portion of a benchmark multiple times. A common approach for memory transfers in particular is to perform 10 iterations and report the fastest time.

Another sanity check: You are running this program on 'bare metal" and not from inside some VM, correct?

1 Like

I don’t have a good idea what I am looking at. If this is a benchmark of system memory throughput, the maximum bandwidth seen across all of this tests seems to be 38.8 GB/sec. Which seems incredibly low for any modern system, and in particular for a system with an A100 in it.

Not sure what to make of the dmidecode output. Is it indicating DDR4-3200 DRAM for the system memory? If so, two memory channels would provide 40 GB/sec (assuming 80% of theoretical peak). Which seems quite close to the measured throughput. It would obviously not be enough to sustain 50 GB/sec of bi-directional PCIe traffic.

How many memory channels are provided by the CPU(s) in this system? Of what kind (e.g. DDR4, DDR5)? Are all channels populated?

How do you read or calculate that number? The rightmost column seems to show between 10 and 18 GB/s?

The highest number I see is 18255.837 MiB/s = 19.143 GB/sec. That’s the copy speed, meaning the memory bandwidth used is twice that (read+write) = 38.286 GB/sec. I must have had a typo somewhere to get to 38.8 GB/sec :-)

It certainly seems to be. Assuming this test, then as njuffa points out the “aggregate” memory bandwidth would be double the reported copy bandwidth.

Since the former is 24GB/s, you’re expecting 48GB/s aggregate. But it looks like your memory bandwidth test only reports about 38GB/s of available bandwidth, roughly consistent with your observed ~32GB/s number.

I’m also not familiar with this CPU core affinity report:

I’m not sure why only cores 24-35 are reported, but there is probably some explanation for that. If this is only a single socket system, then CPU affinity probably is not an issue, although there might still be NUMA affinity if this is a multi-NUMA socket like on modern AMD processors.

  1. I run it using larger memory size, the result has no change.
  2. I run it in a docker container.
  3. I run the benchmark multiple times, and the result is still the same.

My question is :

  1. I used pinned memory, so does cudaMemcpy H2D uses DMA?
  2. how can I check whether the low throughput is limited by cpu ram?
  3. I try testing memory bandwidth on H100, and it’s not larger than the result on A100.

But the PCI-e throughput for bi-directional transfer is as expected.

H2D Bandwidth: 49.8411 GB/s
D2H Bandwidth: 50.0559 GB/s
Parallel Bandwidth: 43.7056 GB/s

So I doubt, does the PCIE throughput has anything to do with memory bandwidth?

I do not know how much hypervisor or VM overhead is involved in that. When measuring hardware performance, I would always advocate running tests on bare metals. The fewer components (HW/SW) are in play, the more easily can one interpret the results.

cudaMemcpyAsync() with pinned host memory results in a DMA transfer (which requires contiguous physical memory, thus the need for host memory to be pinned). That applies to transfers across PCIe in either direction.

By the process we have just gone through based on your data. Your measured system memory throughput is reported as equivalent to 38.3 GB/sec. Bidirectional PCIe traffic across a PCIe4 x16 link requires 50 GB/sec (25 GB/sec per direction). Since 50 > 38.3, bidirectional PCIe throughput is negatively affected by low system memory bandwidth. This would be easy to sanity check if you stated the CPU/memory configuration for the host. Your measured system memory bandwidth looks so low, that I think that either (1) the host system is misconfigured in some way or (2) you are using an unsuitable tool to measure the system memory bandwidth.

Which bandwidth exactly are we talking about? If it is the system memory of the host, that bandwidth is not affected by the choice of GPU at all. It is a function of (1) the number of memory channels per CPU, which I would expect to be on the order of 8-12 for a system with a H100 (2) whether all those channels are populated, i.e. have DIMMs installed in the associated slots (3) the type and speed grade of DRAM being used.

If this is data from the exact same machine as before, except with the A100 replaced by H100, then the “low system memory throughput” hypothesis would be null and void. We can replace it with the working hypothesis that your system memory throughput benchmark app mbw is inadequate for the task at hand and therefore provides misleading data. Try other benchmark tests available to you or compile the classical STREAM benchmark yourself. I used to be an avid users of STREAM but have not used it in over a decade.

I think an adequate re-statement of your issue at this point is: the efficiency of bidirectional PCIe transfers is much higher with the H100 (87.4 / 100 = 0.874) than with the A100 (31.9 / 49 = 0.651). I do not have a working hypothesis for that. I would expect the ratios to be similar (and frankly a bit higher even than 87.4%) but I have never used an A100 or H100 myself, so I do not have an empirical frame of reference.

A100 and H100 are the type of GPU that are supposed to be sold only as part of pre-configured systems, with support provided by the system vendor. This could be NVIDIA itself, or more likely an NVIDIA-approve system integrator. My recommendation would be to bring this issue to the system vendor’s attention and have them assist you in resolving it.

RTX 5090 D PCIE 5.0

./pciebench
H2D Transfer Time: 0.0797079 seconds
D2H Transfer Time: 0.0969715 seconds
Parallel H2D and D2H Transfer Time: 0.133129 seconds
H2D Bandwidth: 24.5035 GB/s
D2H Bandwidth: 20.1412 GB/s
Parallel Bandwidth: 14.671 GB/s
(.venv) edison@u24:~/Downloads$ ./pciebench
H2D Transfer Time: 0.0747494 seconds
D2H Transfer Time: 0.0802698 seconds
Parallel H2D and D2H Transfer Time: 0.133048 seconds
H2D Bandwidth: 26.129 GB/s
D2H Bandwidth: 24.332 GB/s
Parallel Bandwidth: 14.6798 GB/s
(.venv) edison@u24:~/Downloads$ ./pciebench
H2D Transfer Time: 0.0758802 seconds
D2H Transfer Time: 0.0812425 seconds
Parallel H2D and D2H Transfer Time: 0.13304 seconds
H2D Bandwidth: 25.7396 GB/s
D2H Bandwidth: 24.0407 GB/s
Parallel Bandwidth: 14.6808 GB/s