Inconsistent concurrent transfer speed

I’m facing an issue where I get inconsistent aggregate transfer speeds when using concurrent transfers to 4 GPUs.

I originally observed this when I was executing similar update/transfer work running on 4 separate threads, with each thread updating it’s own GPU device in CUDA. I was trying to do OptiX BVH builds, which seemed to take a much longer time on 2 threads than the other 2 threads. I added some timing around async memcpy using CUDA events, and observed that going from 1 device to 2 only added a small overhead, but going to 3 was a big jump, and then to 4 was another medium jump. The OptiX timings were a red herring because they were waiting on async transfer to complete.

I downloaded a bandwidth tester that tests bandwidth non-concurrently, and found that the tool reports that the first two GPUs get around 16 GB/s across PCI-E, while the last two GPUs get around 9 GB/s.

If I look at approximate aggregate transfer values based on my original program, I see that a single GPU gets decent performance (away from my machine right now and don’t want to guess at these numbers) but when I do multiple concurrently, the performance degrades drastically (with all 4 going at once, I see about an aggregate 25 GB/s, or a bit over 6 GB/s per GPU).

I tried using this utility: GitHub - enfiskutensykkel/multi-gpu-bwtest: Measure bandwidth of multiple simultaneously started cudaMemcpyAsync is what I was using for the standalone-ish behavior. For example, running like:

$ ./bwtest --do=all:HtoD:100000000
Allocating buffers…DONE
Executing transfers…DONE
Synchronizing streams…DONE

=====================================================================================
ID Device name Transfer size Direction Time elapsed Bandwidth

0 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 6020 µs 16611.03 MiB/s
1 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 5981 µs 16719.10 MiB/s
2 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 10636 µs 9401.78 MiB/s
3 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 10631 µs 9406.87 MiB/s

Aggregated total time : 33268 µs
Aggregated total bandwidth : 12023.53 MiB/s
Estimated elapsed time : 10742 µs
Timed total bandwidth : 37237.84 MiB/s

I’m also attaching a program I wrote this morning to dig further. Interestingly if I run serially or even concurrently async in the same (main) thread I usually get reasonable bandwidth numbers, but if I run independently in separate threads, that is no longer the case.

Here are the results I get with the attached program:

$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=0
Using 4 devices
Bandwidth for device 0 is 9.06156e+09
Bandwidth for device 1 is 9.23734e+09
Bandwidth for device 2 is 9.1473e+09
Bandwidth for device 3 is 9.0757e+09
$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=0
Using 4 devices
Bandwidth for device 0 is 1.76659e+10
Bandwidth for device 1 is 1.78821e+10
Bandwidth for device 2 is 1.78811e+10
Bandwidth for device 3 is 1.79021e+10

$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=1
Using 4 devices
Bandwidth for device 0 is 9.30287e+09
Bandwidth for device 1 is 1.42203e+10
Bandwidth for device 2 is 1.43691e+10
Bandwidth for device 3 is 1.53139e+10

$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=1
Using 4 devices
Bandwidth for device 0 is 1.81344e+10
Bandwidth for device 1 is 1.80353e+10
Bandwidth for device 2 is 1.79322e+10
Bandwidth for device 3 is 1.79735e+10

$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=2
Using 4 devices
Bandwidth for device 0 is 5.39684e+09
Bandwidth for device 1 is 5.71888e+09
Bandwidth for device 2 is 8.9671e+09
Bandwidth for device 3 is 6.3012e+09

$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=2
Using 4 devices
Bandwidth for device 0 is 1.24669e+10
Bandwidth for device 1 is 1.68448e+10
Bandwidth for device 2 is 2.08564e+10
Bandwidth for device 3 is 1.73517e+10

Note that this is not without a blip… “–pagelock=false --kilobytes=1000 --concurrency=1” shows one device getting much lower bandwidth, but it is conceivable that the OS schedules the main thread on another CPU for a time. If I run any combination often enough I will usually see some kind of blip where one device is slower than the others, but the serial and async concurrency modes are far more stable than the mode run concurrently in separate threads.

I have now added a mode for NUMA to bind to cores on either node 0 or node 1. This doesn’t appear to have much affect (I do see that on average transfers might be slightly faster on node 0 than node 1 with no concurrency, but this is by a factor of maybe 5%).

So my question is (a) should I observe drastically different bandwidths to different GPUs? and (b) should I observe this much slowdown when driving transfers across to all GPUs concurrently?

[actually, how do I attach the program? Let me know if I can attach, or if I need to make a reply with code inline]

what is the motherboard (CPU, and PCIE) topology?
what is the electrical link width for each PCIE slot?
what PCIE generation is in effect for each slot?

please don’t post code via attachments on this forum. Post inline. The forum has tools for formatting code in a post. The basic methodology is to post your code in the edit window, then select your code, then press the </> button at the top toolbar of the edit window, then save your changes.

#include <sched.h>
#include <unistd.h>

#include <deque>
#include <iostream>
#include <thread>

#include "gflags/gflags.h"

DEFINE_int32(trials, 10, "Number of trials to run");
DEFINE_int32(kilobytes, 1000, "Number of kilobytes per transfer");
DEFINE_bool(pagelock, true, "Pagelock host mem");
// hard coded cores for testing.
DEFINE_int32(
    numa_node,
    -1,
    "Alloc host memory and transfer while on a single CPU's cores.  -1 means no NUMA-awareness");
DEFINE_int32(
    concurrency,
    0,
    "Whether to run multi-device transfers concurrently.  0 is serial.  1 is overlapping async.  2 is parallel in threads");
DEFINE_bool(host_to_dev, true, "Whether to do transfers to or from device");

constexpr size_t kBytesInKB = 1024;

class DevTiming {
 public:
  DevTiming() {
    if (FLAGS_pagelock) {
      cudaMallocHost(&hostBuffer_, FLAGS_kilobytes * kBytesInKB);
    } else {
      hostBuffer_ = malloc(FLAGS_kilobytes * kBytesInKB);
    }
    cudaMalloc(&devBuffer_, FLAGS_kilobytes * kBytesInKB);
    cudaEventCreate(&start_);
    cudaEventCreate(&stop_);
    cudaStreamCreate(&stream_);
  }

  ~DevTiming() {
    cudaStreamDestroy(stream_);
    cudaEventDestroy(stop_);
    cudaEventDestroy(start_);
    cudaFree(devBuffer_);
    if (FLAGS_pagelock) {
      cudaFreeHost(hostBuffer_);
    } else {
      free(hostBuffer_);
    }
  }

  void startTransfer() {
    cudaEventRecord(start_, stream_);
    if (FLAGS_host_to_dev) {
      cudaMemcpyAsync(
          devBuffer_, hostBuffer_, FLAGS_kilobytes * kBytesInKB, cudaMemcpyHostToDevice, stream_);
    } else {
      cudaMemcpyAsync(
          hostBuffer_, devBuffer_, FLAGS_kilobytes * kBytesInKB, cudaMemcpyDeviceToHost, stream_);
    }
    cudaEventRecord(stop_, stream_);
  }

  void waitForTransfer() {
    cudaEventSynchronize(stop_);
    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start_, stop_);
    total_ += ms * 1e-3;
  }

  double totalTime() const {
    return total_;
  }

 private:
  double total_ = 0.0;
  cudaEvent_t start_, stop_;
  cudaStream_t stream_;
  void* devBuffer_;
  void* hostBuffer_;
};

void runDeviceSerial(std::deque<DevTiming>& timings) {
  for (auto trial = FLAGS_trials; trial--;) {
    int dev = 0;
    for (auto& timing : timings) {
      cudaSetDevice(dev++);
      timing.startTransfer();
      timing.waitForTransfer();
    }
  }
}

void runDeviceConcurrent(std::deque<DevTiming>& timings) {
  for (auto trial = FLAGS_trials; trial--;) {
    int dev = 0;
    for (auto& timing : timings) {
      cudaSetDevice(dev++);
      timing.startTransfer();
    }

    dev = 0;
    for (auto& timing : timings) {
      cudaSetDevice(dev++);
      timing.waitForTransfer();
    }
  }
}

int g_startNode = -1;

void runDeviceParallel(std::deque<DevTiming>& timings) {
  std::deque<std::thread> threads;

  int dev = 0;
  for (auto& timing : timings) {
    threads.emplace_back([dev, &timing]() {
      if (g_startNode >= 0) {
        int cur = sched_getcpu();

        if (cur < g_startNode || cur >= g_startNode + 16) {
          std::cerr << "Incorrect CPU!" << std::endl;
          std::abort();
        }
      }

      cudaSetDevice(dev);
      for (auto trial = FLAGS_trials; trial--;) {
        timing.startTransfer();
        timing.waitForTransfer();
      }
    });
    ++dev;
  }
  for (auto& t : threads) {
    t.join();
  }
}
constexpr int kSerial = 0;
constexpr int kConcurrent = 1;

int main(int argc, char** args) {
  gflags::ParseCommandLineFlags(&argc, &args, true);

  if (FLAGS_numa_node >= 0) {
    // On my machine, numactl --hardware shows that [0-63] and [128-191] are on node 0, and [64-127]
    // and [192-255] are node 1
    cpu_set_t set;

    g_startNode = FLAGS_numa_node == 0 ? 0 : 64;

    CPU_ZERO(&set);
    for (int i = 0; i < 16; ++i) {
      CPU_SET(i + g_startNode, &set);
    }
    sched_setaffinity(getpid(), sizeof(set), &set);

    int cur = sched_getcpu();

    if (cur < g_startNode || cur >= g_startNode + 16) {
      std::cerr << "Incorrect CPU!" << std::endl;
      std::abort();
    }
  }

  std::deque<DevTiming> timings;

  // warm up
  cudaFree(0);

  int numDevices = 0;
  cudaGetDeviceCount(&numDevices);

  std::cout << "Using " << numDevices << " devices" << std::endl;

  for (int i = 0; i < numDevices; ++i) {
    timings.emplace_back();
  }

  if (FLAGS_concurrency == kConcurrent) {
    runDeviceConcurrent(timings);
  } else if (FLAGS_concurrency == kSerial) {
    runDeviceSerial(timings);
  } else {
    runDeviceParallel(timings);
  }

  int dev = 0;
  for (auto& timing : timings) {
    std::cout << "Bandwidth for device " << dev++ << " is "
              << (FLAGS_kilobytes * FLAGS_trials * kBytesInKB) / timing.totalTime() << std::endl;
  }
  return 0;
}

Motherboard is H12DSG-O-CPU. Operating System is CentOS 8 Stream. Driver is 495.44.

Topology is:

> $ nvidia-smi topo -mp                                                                                         
        GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity
GPU0	 X 	    NODE	SYS	    SYS	    0-63,128-191	0
GPU1	NODE	 X 	    SYS	    SYS	    0-63,128-191	0
GPU2	SYS	    SYS	     X 	    NODE	64-127,192-255	1
GPU3	SYS	    SYS	    NODE	 X 	    64-127,192-255	1

Also

> $ lspci | egrep RTX                                                                                           
25:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
81:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
e1:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

These are all 16x PCI-E 4.0 links, though this is in a server that has 20 8x connections into the root complex, and there is a riser board that takes 10 16x connections and puts them through the 20 8x connections (at least my rough understanding). 4124GS-TNR | 4U | A+ Servers | Products | Super Micro Computer, Inc.

What is the output of nvidia-smi -i 0 -a ? It will be somewhat lengthy. Please post it like you would post code.

> $ nvidia-smi -i 0 -a                                                                                                                                  

==============NVSMI LOG==============

Timestamp                                 : Fri Apr 14 15:01:45 2023
Driver Version                            : 495.44
CUDA Version                              : 11.5

Attached GPUs                             : 4
GPU 00000000:25:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-3305cae6-5920-6f99-e7a2-824d8a463183
    Minor Number                          : 1
    VBIOS Version                         : 94.02.26.08.1C
    MultiGPU Board                        : No
    Board ID                              : 0x2500
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x25
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:25:00.0
        Sub System Id                     : 0x403B1458
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24268 MiB
        Used                              : 5 MiB
        Free                              : 24263 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 29 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.18 W
        Power Limit                       : 350.00 W
        Default Power Limit               : 350.00 W
        Enforced Power Limit              : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 743.750 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 7445
            Type                          : G
            Name                          : /usr/libexec/Xorg
            Used GPU Memory               : 4 MiB

If you don’t have the bandwidthTest cuda sample code available on your system, please get it. Then run:

taskset -c 0  ./bandwidthTest
taskset -c 64 ./bandwidthTest

and report the result.

$ taskset -c 0 ./bandwidthTest [±master ●]
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA GeForce RTX 3090
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 26.2

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 26.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 785.3

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

bbudge@bbudge-centos-L1WCVH9NS7 ~/repos/cuda-samples/Samples/1_Utilities/bandwidthTest [15:21:25]

$ taskset -c 64 ./bandwidthTest [±master ●]
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA GeForce RTX 3090
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 24.8

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 23.8

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 787.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

As part of the process of eliminating factors that might play into the observed performance (even if not very likely), a few sanity checks:

How is the system RAM configured in this system? The description says there are 16 DIMM slots in total, four per CPU. Are all DIMM slots populated? Are all DIMM slots populated with identical DRAM sticks? What are the specifications of these DRAM sticks?

This system uses Epyc CPUs, correct? What kind specifically? If I perused the thread correctly, you already set up NUMA configuration for processor and memory affinity to account for the fact that each processor comprises multiple CPU complexes?

The system’s PCIe configuration, when combined with Epyc CPUs, looks (overly?) complex to me, not anything I have experience with. Have you brought your observations to the attention of SuperMicro to see what advice they can dispense? There might be some well-known configuration caveats.

Thanks Norbert. There are 4 sets of RAM slots, two for each CPU (EPYC 7742). Each set has four out of 8 DIMMs populated (at you noted, 16 DIMMs total). I’ve measured aggregate memory bandwidth on the system at well over 300 GB/s. According to (https://en.wikichip.org/wiki/amd/epyc/7742) max expected bandwidth per CPU is 190 GB/s, so I think the DIMMs are populated correctly. The RAM is DDR4-3200.

In my own test, I did set up the ability to pin to cores within NUMA 0 or NUMA 1. I find this makes a difference of about 10% in the serial tests, which feels reasonable to me.

The big thing I’m noticing is that I tend to get these slowdowns mostly when running concurrently from different threads. If I run concurrently from the same thread, I tend to get more stable, better, performance ( still sometimes see one device getting transfers slower than others, but only rarely more than 10% or 15%).

I’m also talking to LambdaLabs which is the vendor I went through for the machine to see if they know of any issues. I recently updated the BIOS at their suggestion, which did seem to improve average and upper limit transfer speeds, but didn’t solve the stability/unequal performance issue.

Dual EPYC 7742 make for a super-complex topology that may exhibit multiple levels of NUMA-ness. Each EPYC 7742 comprises 16 CCX (compute core complexes) each of which comprises four CPU cores. The questions to which I cannot find answers quickly is how the 16 CCXs in a device are linked to the 128 PCIe lanes and the 8 memory controllers per device, and and how the 16 CCXs in each device are coupled to each other under the hood. Add to that possible complications from the PCIe switch that adds flexibility to the system.

NUMA affinity should be configured such that each GPU communicates with the “closest” (minimal number of hops) CCXs and memory controllers, which possibly means that host software threads that communicate with a particular GPU should be limited to a smallish group of CPU cores.

I have never dealt with a system of this complexity. I am under the impression that @Robert_Crovella has experience with large systems.

There’s a NUMA summary for this processor here.

1 Like

Please compile this code:

#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main(int argc, char *argv[]){

  size_t ds = 1048576*32;
  if (argc > 1) ds = atoi(argv[1]);
  int nd;
  cudaGetDeviceCount(&nd);
  char **h = new char*[nd];
  char **d = new char*[nd];
  cudaStream_t *s = new cudaStream_t[nd];
  for (int i = 0; i < nd; i++){
    cudaSetDevice(i);
    cudaHostAlloc(h+i, ds, cudaHostAllocDefault);
    cudaMalloc(d+i, ds);
    cudaStreamCreate(s+i);}
  unsigned long long dt = dtime_usec(0);
  for (int i = 0; i < nd; i++){
    cudaSetDevice(i);
    cudaMemcpyAsync(d[i], h[i], ds, cudaMemcpyHostToDevice, s[i]);}
  for (int i = 0; i < nd; i++) {
    cudaSetDevice(i);
    cudaDeviceSynchronize();}
  dt = dtime_usec(dt);
  std::cout << "aggregate host to device bandwidth for " << nd << " devices: " << ds*nd/(float)dt << "MB/s" << std::endl;
  dt = dtime_usec(0);
  for (int i = 0; i < nd; i++){
    cudaSetDevice(i);
    cudaMemcpyAsync(h[i], d[i], ds, cudaMemcpyDeviceToHost, s[i]);}
  for (int i = 0; i < nd; i++) {
    cudaSetDevice(i);
    cudaDeviceSynchronize();}
  dt = dtime_usec(dt);
  std::cout << "aggregate device to host bandwidth for " << nd << " devices: " << ds*nd/(float)dt << "MB/s" << std::endl;
}

Let’s say the name of the compiled executable is test. Then run the following tests:

CUDA_VISIBLE_DEVICES="0,1" taskset -c 0   ./test 1000000000
CUDA_VISIBLE_DEVICES="0,1" taskset -c 64  ./test 1000000000

and report the results.

I ran multiple trials to make sure I was seeing consistent results, which I do (only copying two results from each run below)

bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                  [12:42:20]
> $ CUDA_VISIBLE_DEVICES="0,1" taskset -c 0   ./test 1000000000                                                           

aggregate host to device bandwidth for 2 devices: 44471.1MB/s
aggregate device to host bandwidth for 2 devices: 44194MB/s
                                                                                                                           
bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                  [12:42:38]
> $ CUDA_VISIBLE_DEVICES="0,1" taskset -c 0   ./test 1000000000                                                           

aggregate host to device bandwidth for 2 devices: 44363.6MB/s
aggregate device to host bandwidth for 2 devices: 44048MB/s
                                                                                                                           
bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                  [12:42:40]
> $ CUDA_VISIBLE_DEVICES="0,1" taskset -c 64   ./test 1000000000                                                          

aggregate host to device bandwidth for 2 devices: 28782.8MB/s
aggregate device to host bandwidth for 2 devices: 43779MB/s
                                                                                                                           
bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                  [12:42:42]
> $ CUDA_VISIBLE_DEVICES="0,1" taskset -c 64   ./test 1000000000                                                          

aggregate host to device bandwidth for 2 devices: 28689.1MB/s
aggregate device to host bandwidth for 2 devices: 43780.9MB/s

Let’s also verify that if the process is placed on core 64, that we will have similar performance talking to GPUs 2,3:

CUDA_VISIBLE_DEVICES="2,3" taskset -c 64  ./test 1000000000

(Subsequent assumes that gives similar perf to the 0,1 test on core 0.)

For a single GPU test, the inter-socket link doesn’t seem to matter much (there is a small reduction, perhaps). For a dual-GPU test, the inter-socket link has a noticeable impact.

I would say then, that in order to achieve the least variable bandwidth measurement with concurrent transfers across 4 GPUs in your system, one approach would be to construct a test like this, with GPUs 0 and 1 managed by a thread or process pinned to core 0, and GPUs 2 and 3 managed by a thread or process pinned to core 64.

Please compile this code:

#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#include <omp.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const int nt = 2;
const int dpt = 2;
int main(int argc, char *argv[]){

  size_t ds = 1048576*32;
  if (argc > 1) ds = atoi(argv[1]);
  int nd;
  cudaGetDeviceCount(&nd);
  if (nd < dpt*nt) {std::cout << "not enough devices" << std::endl; return 0;}
  char **h = new char*[nd];
  char **d = new char*[nd];
  float *p = new float[nd];
  cudaStream_t *s = new cudaStream_t[nd];
#pragma omp parallel
  {
  int t = omp_get_thread_num();
  if (t < nt){
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaHostAlloc(h+i+t*dpt, ds, cudaHostAllocDefault);
      cudaMalloc(d+i+t*dpt, ds);
      cudaStreamCreate(s+i+t*dpt);}
#pragma omp barrier
    unsigned long long dt = dtime_usec(0);
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaMemcpyAsync(d[i+t*dpt], h[i+t*dpt], ds, cudaMemcpyHostToDevice, s[i+t*dpt]);}
    for (int i = 0; i < dpt; i++) {
      cudaSetDevice(i+t*dpt);
      cudaDeviceSynchronize();}
    dt = dtime_usec(dt);
    p[t*dpt] = ds*dpt/(float)dt;
#pragma omp barrier
    dt = dtime_usec(0);
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaMemcpyAsync(h[i+t*dpt], d[i+t*dpt], ds, cudaMemcpyDeviceToHost, s[i+t*dpt]);}
    for (int i = 0; i < dpt; i++) {
      cudaSetDevice(i);
      cudaDeviceSynchronize();}
    dt = dtime_usec(dt);
    p[t*dpt+1] = ds*dpt/(float)dt;
  }
  }
  for (int t = 0; t < nt; t++)
    std::cout << "aggregate host to device bandwidth for thread " << t << " devices: " << p[t*dpt] << "MB/s" << std::endl;
  for (int t = 0; t < nt; t++)
    std::cout << "aggregate device to host bandwidth for thread " << t << " devices: " << p[t*dpt+1] << "MB/s" << std::endl;
}

As follows:

nvcc -Xcompiler -fopenmp test.cu -o test -lgomp

And run it as follows:

OMP_NUM_THREADS=2 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{64}" ./test 1000000000

and report.

The only thing I haven’t accounted for is memory affinity. There may be some magic needed to get OMP to allocate memory local to the socket that the allocation request comes from.

bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                   [5:53:37]
> $ CUDA_VISIBLE_DEVICES="2,3" taskset -c 64  ./test 1000000000                                                           
aggregate host to device bandwidth for 2 devices: 41891.8MB/s
aggregate device to host bandwidth for 2 devices: 44022.8MB/s
                                                                                                                           
bbudge@bbudge-centos-L1WCVH9NS7 ~/tests/testbw                                                                   [5:53:46]
> $ CUDA_VISIBLE_DEVICES="2,3" taskset -c 0  ./test 1000000000                                                            
aggregate host to device bandwidth for 2 devices: 29053.3MB/s
aggregate device to host bandwidth for 2 devices: 41835.8MB/s

So yeah, that seems symmetric.

> $ OMP_NUM_THREADS=2 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{64}" ./test 1000000000                                          
aggregate host to device bandwidth for thread 0 devices: 43797.2MB/s
aggregate host to device bandwidth for thread 1 devices: 42585.8MB/s
aggregate device to host bandwidth for thread 0 devices: 44340MB/s
aggregate device to host bandwidth for thread 1 devices: 44338MB/s

I modified your code to the following:

#include <iostream>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#include <omp.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main(int argc, char *argv[]){

  size_t ds = 1048576*32;

  int nt = 2;
  int dpt = 2;
  
  if (argc > 1) ds = atoi(argv[1]);
  if (argc > 2) {
    nt = atoi(argv[2]);
    dpt = atoi(argv[3]);
  }
  int nd;
  cudaGetDeviceCount(&nd);
  if (nd < dpt*nt) {std::cout << "not enough devices" << std::endl; return 0;}
  char **h = new char*[nd];
  char **d = new char*[nd];
  float *p = new float[2*nt];
  cudaStream_t *s = new cudaStream_t[nd];
#pragma omp parallel
  {
  int t = omp_get_thread_num();
  if (t < nt){
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaHostAlloc(h+i+t*dpt, ds, cudaHostAllocDefault);
      cudaMalloc(d+i+t*dpt, ds);
      cudaStreamCreate(s+i+t*dpt);}
#pragma omp barrier
    unsigned long long dt = dtime_usec(0);
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaMemcpyAsync(d[i+t*dpt], h[i+t*dpt], ds, cudaMemcpyHostToDevice, s[i+t*dpt]);}
    for (int i = 0; i < dpt; i++) {
      cudaSetDevice(i+t*dpt);
      cudaDeviceSynchronize();}
    dt = dtime_usec(dt);
    p[t] = ds*dpt/(float)dt;
#pragma omp barrier
    dt = dtime_usec(0);
    for (int i = 0; i < dpt; i++){
      cudaSetDevice(i+t*dpt);
      cudaMemcpyAsync(h[i+t*dpt], d[i+t*dpt], ds, cudaMemcpyDeviceToHost, s[i+t*dpt]);}
    for (int i = 0; i < dpt; i++) {
      cudaSetDevice(i);
      cudaDeviceSynchronize();}
    dt = dtime_usec(dt);
    p[nt+t] = ds*dpt/(float)dt;
  }
  }
  for (int t = 0; t < nt; t++)
    std::cout << "aggregate host to device bandwidth for thread " << t << " devices: " << p[t] << "MB/s" << std::endl;
  for (int t = 0; t < nt; t++)
    std::cout << "aggregate device to host bandwidth for thread " << t << " devices: " << p[nt+t] << "MB/s" << std::endl;
}

Now we can see that if we initiate from different CCXs that we can get good bandwidth, while if we initiate from the same CCX in parallel that we get worse bandwidth:

> $ OMP_NUM_THREADS=4 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{64},{63}" ./test 1000000000 4 1                             
aggregate host to device bandwidth for thread 0 devices: 16782.2MB/s
aggregate host to device bandwidth for thread 1 devices: 20624.9MB/s
aggregate host to device bandwidth for thread 2 devices: 25682.5MB/s
aggregate host to device bandwidth for thread 3 devices: 13579.2MB/s
aggregate device to host bandwidth for thread 0 devices: 16660.3MB/s
aggregate device to host bandwidth for thread 1 devices: 16659.4MB/s
aggregate device to host bandwidth for thread 2 devices: 16659.4MB/s
aggregate device to host bandwidth for thread 3 devices: 7.14286e+07MB/s

> $ OMP_NUM_THREADS=4 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{8},{64},{72}" ./test 1000000000 4 1                             
aggregate host to device bandwidth for thread 0 devices: 22609.6MB/s
aggregate host to device bandwidth for thread 1 devices: 22054.6MB/s
aggregate host to device bandwidth for thread 2 devices: 21748.1MB/s
aggregate host to device bandwidth for thread 3 devices: 23009.1MB/s
aggregate device to host bandwidth for thread 0 devices: 22182.8MB/s
aggregate device to host bandwidth for thread 1 devices: 22181.8MB/s
aggregate device to host bandwidth for thread 2 devices: 22183.3MB/s
aggregate device to host bandwidth for thread 3 devices: 22182.3MB/s
                                                                                                                           
> $ OMP_NUM_THREADS=4 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{9},{64},{73}" ./test 1000000000 4 1                             
aggregate host to device bandwidth for thread 0 devices: 22600.4MB/s
aggregate host to device bandwidth for thread 1 devices: 22038.6MB/s
aggregate host to device bandwidth for thread 2 devices: 21797.4MB/s
aggregate host to device bandwidth for thread 3 devices: 22964.2MB/s
aggregate device to host bandwidth for thread 0 devices: 22177.9MB/s
aggregate device to host bandwidth for thread 1 devices: 22177.9MB/s
aggregate device to host bandwidth for thread 2 devices: 22177.9MB/s
aggregate device to host bandwidth for thread 3 devices: 22146.9MB/s
                                                                                                                           
> $ OMP_NUM_THREADS=4 OMP_PROC_BIND=TRUE OMP_PLACES="{0},{16},{64},{80}" ./test 1000000000 4 1                            
aggregate host to device bandwidth for thread 0 devices: 22749MB/s
aggregate host to device bandwidth for thread 1 devices: 21903.4MB/s
aggregate host to device bandwidth for thread 2 devices: 21258.5MB/s
aggregate host to device bandwidth for thread 3 devices: 23014.4MB/s
aggregate device to host bandwidth for thread 0 devices: 22166.6MB/s
aggregate device to host bandwidth for thread 1 devices: 22166.1MB/s
aggregate device to host bandwidth for thread 2 devices: 22168.5MB/s
aggregate device to host bandwidth for thread 3 devices: 22166.6MB/s

So I think for this particular machine, this gives good insight. Probably also it makes sense for AMD architectures in general (schedule at most one transfer per CCX at a time, or use async from one single thread in a CCX concurrently).

Is there a good rule of thumb for other architectures? Xeon, PPC, ARM?

Also, is there a way to programmatically determine NUMA + PCI-E topology on different OSs? It is one thing to hard code functionality for the machine I’m testing on, but ideally this would work “well” on any machine I tried to run on.

I believe on Linux this could be done by querying CUDA for PCI-E connection (cudaDeviceGetPCIBusId), an then scanning for /sys/class/pci_bus/$BUSID/cpulistaffinity for the NUMA ranges directly connected. We’d still need to build in some voodoo for e.g. the CCX stuff discovered in this thread.

Ideas for Windows?

I think the last test I wrote demonstrates that if you pay attention to CPU/GPU affinity by CPU socket, you can get good results. I think that is the general rule of thumb that I have followed.

There are ways to determine NUMA+PCIE topology programmatically on linux. I don’t know about windows.

Well, CPU socket (NUMA node 0 and 1) are a part of the story, but what I found very useful from this thread was that I was able to modify your code and test what happens when threads are distributed across CCXs. I have a case where a lot of setup is done per GPU (e.g. 000000, 111111, 222222, 333333), and it is not easy to reorganize the code so that different GPU setup work is done like (0123012301230123), therefore it is pretty important for me to run different device setup in parallel threads.

My question was around gotchas like this (not sharing CCX between multiple threads) on other platforms.

In any case, thanks very much Robert, your thoughts and code were super helpful!

The only test of your latest set that you have run that demonstrates inconsistent results is the first one:

And in that case you are putting 3 threads on one socket. That means one thread is talking to “its” GPU across the inter-socket link. The previous tests demonstrated that could be problematic. The other tests all have 2 threads talking to their GPUs on socket 0 and 2 threads talking to their GPUs on socket 1. So that is just following the pattern I suggested. The only thing “new” that I can see is that you don’t necessarily have to use just one thread per socket, and you don’t necessarily have to use the logical cores I picked.

Anyway, good luck! I think you have ironed out the inconsistent transfer speed, or at least you know how to avoid it.

I don’t know why you came to that conclusion. My code scheduled two simultaneous transfers per CCX. But do as you wish, of course.