Unexpected Performance Discrepancy Between RTX A6000 and RTX 3090

ningzihan1996 · July 24, 2024, 1:21pm

Hi all,

I’m encountering a significant performance discrepancy between our server with three NVIDIA RTX A6000 GPUs and a PC with a single NVIDIA GeForce RTX 3090 GPU. Despite understanding that the benchmarks should be relatively close (test on every GPU independently), the GPU RTX A6000 underperform compared to GPU RTX 3090 greatly. Below are the details of the tests and results we’ve obtained.

Benchmark Test Details

We ran a Python benchmark on both systems, performing matrix multiplication on two large random matrices. The script is as follows:

import torch
import time

def benchmark(device):
    if device != 'cpu':
        torch.cuda.set_device(device)
        device_name = torch.cuda.get_device_name(device)
    else:
        device_name = 'CPU'
    print(f"Running benchmark on {device_name}")

    size = 10000
    iterations = 10

    tensor_a = torch.randn(size, size, device=device)
    tensor_b = torch.randn(size, size, device=device)

    for _ in range(10):
        _ = torch.mm(tensor_a, tensor_b)

    start_time = time.time()
    for _ in range(iterations):
        _ = torch.mm(tensor_a, tensor_b)
    end_time = time.time()

    avg_time = (end_time - start_time) / iterations
    print(f"Average time for matrix multiplication of {size}x{size} tensors on {device_name}: {avg_time:.6f} seconds")

if __name__ == "__main__":
    if torch.cuda.is_available():
        num_gpus = torch.cuda.device_count()
        print(f"Number of GPUs available: {num_gpus}")
        for i in range(num_gpus):
            benchmark(i)
    else:
        print("CUDA is not available. Running benchmark on CPU only.")
        benchmark('cpu')

Benchmark Results

PC with one GPU (RTX 3090)

Number of GPUs available: 1
GPU (RTX 3090): Average time: 0.000008 seconds
CPU: Average time: 0.918097 seconds

Server with 3 GPUs (RTX A6000):

Number of GPUs available: 3
GPU 1 (RTX A6000): Average time: 0.000015 seconds
GPU 2 (RTX A6000): Average time: 0.000014 seconds
GPU 3 (RTX A6000): Average time: 0.000014 seconds
CPU: Average time: 0.693448 seconds

We also conducted similar MATLAB tests (multiply two large random matrixes) focusing on simple GPU operations:

MATLAB Results:

RTX 3090: Computation time: 0.48609 seconds
RTX A6000: Computation time: 1.0159 seconds

System Specifications

PC with one GPU (RTX 3090):

CPU: 12 cores @ 3.5GHz
GPU: NVIDIA GeForce RTX 3090 (24GB Memory)
Ubuntu 20.04.6 LTS
CUDA Version: 12.2
Driver Version: 535.183.01
python version: 3.9.7
matlab version: 2021a

Server with 3 GPUs (RTX A6000)

CPU: 2 sockets, 16 cores each @ 2.4GHz
GPU: 3x NVIDIA RTX A6000 (48GB Memory each)
Ubuntu 22.04.4 LTS
CUDA Version: 12.2
Driver Version: 535.183.01
python version: 3.9.7
matlab version: 2021a

The performance difference between the RTX A6000 and RTX 3090 is unexpected. We are seeking advice on:

Potential bottlenecks or misconfigurations.
Additional tests or diagnostics to perform.
Any insights into why the RTX A6000 might underperform compared to the RTX 3090 in our use case.

Your expertise and suggestions would be immensely valuable in helping us resolve this issue. Thank you in advance for your assistance!

Best regards,
Zihan

Frank_Quadro · September 20, 2024, 12:22pm

Bunch of ideas:
How long are your tests running?
To make sure significant portion of benchmark is actually running on the GPU, vs loading overhead…
How busy is your CPU, monitor on core level, not averaged out across all cores, maybe some single core CPU bottleneck, your gaming CPU is 50% faster than the server CPU…
Have you monitored the GPUs while benching, clocks, temps powerdraw? Temp too high will keep clocks low, pcie bus low, so other things might actually be slowing the bench-run…
I’d expect roughly similar perf between these 2 GPUs, where the gaming one might clock a little higher …

Topic		Replies	Views
Maximum power draw 3090 CUDA Programming and Performance	3	10525	February 23, 2021
I'm novice, please help -- pure performance CUDA Programming and Performance	17	60	October 30, 2024
RTX 6000 Ada slower than GeForce RTX 3050 in Python with TensorFlow 2? CUDA Programming and Performance	2	822	September 2, 2023
RTX3090 runs slower than RTX2080ti CUDA Programming and Performance	5	1753	June 6, 2021
GPU vs CPU performance comparison CUDA Programming and Performance	9	14957	August 13, 2009
Correctness Problem Using Tensorflow with RTX 4090 Frameworks cuda , tensorflow	2	1310	May 15, 2023
MultiGPU information CUDA Programming and Performance	3	2324	June 8, 2009
RTX3090 runs slower than RTX2080ti Profiling Linux Targets nsight	1	744	July 19, 2021
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	949	April 7, 2021
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1853	October 26, 2017

Unexpected Performance Discrepancy Between RTX A6000 and RTX 3090

Benchmark Test Details

Benchmark Results

System Specifications

Related topics