PCIe A100s - Slow PCIe speed tests?

I have several servers with PCIE A100s installed and they all suffer the same odd behavior when testing PCIE link speed. Speeds are from the CUDA bandwidth test.

90% of the time speed will be around 18-19GB/s (PCIE 4.0 @ 16x), and the other 10% the speed will show the correct 22-24GB/s speed I would expect. Card memory speed is also reported lower, from around 1150GB/s to 1000GB/s.

This seems to have no real performance effect, it just bothers me. These are the only cards I have that have this issue.

I’ve tried enabling persistence mode but that has no effect.

The servers are all AMD EPYC 7002 based systems, but I have seen the same issue with Intel based systems as well (PCIE speed is not affected, as they run at PCIE 3.0, but the GPU memory speed has the same weird issue, reporting around 1000GB/s 90% of the time)

Any insight into what this could be caused by and/or what the solution may be would be greatly appreciated.

Thank you.

This CPU is made up of multiple dies connected by a cache-coherent interconnect. Each chiplet has its own PCIe root complex and DDR4 memory controller. So essentially this is a NUMA system in a single package. Each chiplet reaches other chiplets with one, or at most two, hops. Nonetheless this adds overhead when a GPU communicates with a “far” chiplet and accesses the “far” memory attached to it.

For best performance, use numactl memory and processor affinity settings. The benchmark process wants to run on the chiplet with the PCIe root complex to which the GPU is directly connected and also use the DRAM controller on that chiplet.

While to my knowledge, Intel processors do not use chiplets yet, your server may be a dual-socket machine, resulting in even more pronounced NUMA behavior than observed with the AMD server chips.

Treat this as a working hypothesis. I have no hands-on experience with large EPYC-based server configurations, but have occasionally run into these NUMA issues over the past two decades. In many circumstances, these modern server architectures can be treated as “sufficiently UMA” due to the use of (internal or external) high-speed links. But when you are looking for the highest, most consistent, performance you would want to use numactl to bind processes.

I can only speculate, not having used an A100. It could be an artifact of the measurement methodology, e.g. the benchmark runs into too short a time due to the high transfer rate, or does not use the best-of-ten runs methodology used by the gold standard of memory bandwidth tests, John McCalpin’s STREAM benchmark. The benchmark may also use memory blocks that are too small to achieve steady-state operation. GB-sized blocks probably are best for this kind of hardware.

To my knowledge, the interfaces of the high-performance memories on the A100 frequently renegotiate bit rates based on current operating conditions. If the memory is running hot, performance may be negatively affected. nvidia-smi lets you monitor the memory temperature on some GPUs.

For a second opinion on the device memory bandwidth, you could try the zcopy program below. With the two GPUs in my system, I need blocks in excess of 1 GB to observe the maximum bandwidth.

C:\Users\Norbert\My Programs>zcopy -d1 -n110000000
zcopy: running on device 1 (Quadro P2000)
zcopy: using vectors of 110000000 double2 elements (= 1.760e+09 bytes)
zcopy: using 128 threads per block, 859375 blocks
zcopy: mintime = 28.749 msec; eqv. memory bandwith = 122.44 GB/sec

C:\Users\Norbert\My Programs>zcopy -d0 -n110000000
zcopy: running on device 0 (Quadro RTX 4000)
zcopy: using vectors of 110000000 double2 elements (= 1.760e+09 bytes)
zcopy: using 128 threads per block, 859375 blocks
zcopy: mintime = 9.530 msec; eqv. memory bandwith = 369.35 GB/sec
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   30000000
#define ZCOPY_DEFDEV   0
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct zcopyOpts {
    int len;
    int dev;
};

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            case 'd':
                opts->dev = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;
    opts.dev = (opts.dev) ? opts.dev : ZCOPY_DEFDEV;

    struct cudaDeviceProp props;
    CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
    printf ("zcopy: running on device %d (%s)\n", opts.dev, props.name);

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    
    printf ("zcopy: using vectors of %d double2 elements (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("zcopy: mintime = %.3f msec; eqv. memory bandwith = %.2f GB/sec\n",
            1.0e3 * mintime, (2 * sizeof(d_a[0]) * opts.len) / (1e9 * mintime));

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}