NVIDIA Quadro RTX 8000 bandwidthTest Theoretical Max Results

Hi All. I would like to know what the max Host to Device Bandwidth and Device to Host Bandwidth for a NVIDIA Quatro RTX 8000 in passthrough mode to a VM running Ubuntu.

I noticed the PCIe Generation Current is 1 but max it 3. Does that mean the GPU is not in the max performance slot?

nvidia-bug-report.log.gz (604.3 KB)

root@xray:/usr/local/cuda-11.4/extras/demo_suite# lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 18.04.5 LTS

Release: 18.04

Codename: bionic

root@xray:/usr/local/cuda-11.4/extras/demo_suite# /usr/local/cuda-11.4/extras/demo_suite/bandwidthTest --memory=pinned
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Quadro RTX 8000
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11709.8

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12716.4

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 465962.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

root@xray:/usr/local/cuda-11.4/extras/demo_suite# nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Mon Aug 9 14:08:39 2021
Driver Version : 470.42.01
CUDA Version : 11.4

Attached GPUs : 1
GPU 00000000:00:06.0
Product Name : Quadro RTX 8000
Product Brand : NVIDIA
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number :
GPU UUID :
Minor Number : 0
VBIOS Version : 90.02.4E.00.02
MultiGPU Board : No
Board ID : 0x6
GPU Part Number : 900-2G150-0050-000
Module ID : 0
Inforom Version
Image Version : G150.0231.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x06
Domain : 0x0000
Device Id : 0x1E7810DE
Bus Id : 00000000:00:06.0
Sub System Id : 0x13D810DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 45556 MiB
Used : 0 MiB
Free : 45556 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 2 MiB
Free : 32766 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 87 C
GPU Slowdown Temp : 84 C
GPU Max Operating Temp : 82 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 19.16 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 1230 MHz
Memory : 6501 MHz
Default Applications Clocks
Graphics : 1230 MHz
Memory : 6501 MHz
Max Clocks
Graphics : 1620 MHz
SM : 1620 MHz
Memory : 6501 MHz
Video : 1500 MHz
Max Customer Boost Clocks
Graphics : 1620 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None

A PCIe gen3 x16 link offers around 12 GB/sec of usable bandwidth in either direction (it’s a full duplex link). Your numbers suggest your GPU is achieving that.

For power efficiency, modern GPUs will drop PCIe links to lower performance levels when not in use.

I have never used a Quadro RTX 8000 so don’t know what to expect for measured device bandwidth. The official specifications say theoretical bandwidth is 672 GB/s. Practically achievable bandwidth on raw metal is typically 80% of theoretical, which would mean 540 GB/sec. I cannot hazard a guess as to how virtualization interacts with the specific benchmark you chose. You may want to run the zcopy device memory benchmarking code below for a second opinion. For example, the output on my Quadro RTX 4000 (theoretical bandwidth per specification: 416 GB/sec) looks like this:

C:\Users\Norbert\My Programs>zcopy -d0 -n100000000
zcopy: running on device 0 (Quadro RTX 4000)
zcopy: using vectors of 100000000 double2 elements (= 1.600e+09 bytes)
zcopy: using 128 threads per block, 781250 blocks
zcopy: mintime = 8.419 msec; eqv. memory bandwith = 380.07 GB/sec
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   30000000
#define ZCOPY_DEFDEV   0
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct zcopyOpts {
    int len;
    int dev;
};

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            case 'd':
                opts->dev = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;
    opts.dev = (opts.dev) ? opts.dev : ZCOPY_DEFDEV;

    struct cudaDeviceProp props;
    CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
    printf ("zcopy: running on device %d (%s)\n", opts.dev, props.name);

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    
    printf ("zcopy: using vectors of %d double2 elements (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("zcopy: mintime = %.3f msec; eqv. memory bandwith = %.2f GB/sec\n",
            1.0e3 * mintime, (2 * sizeof(d_a[0]) * opts.len) / (1e9 * mintime));

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}
1 Like

I just confirmed the PCI Current went to 3 under load. Thank you for that information.

zcopy results:

nutanix@xray:/usr/local/cuda-11.4/extras/demo_suite$ ./zcopy -d0 -n100000000
zcopy: running on device 0 (Quadro RTX 8000)
zcopy: using vectors of 100000000 double2 elements (= 1.600e+09 bytes)
zcopy: using 128 threads per block, 781250 blocks
zcopy: mintime = 6.840 msec; eqv. memory bandwith = 467.84 GB/sec

At least the zcopy result is consistent with the result from bandwidthTest. Why both are lower than expected I cannot say. You may want to try running on bare metal instead of through a VM.

1 Like