Unexpected Performance Discrepancy Between RTX A6000 and RTX 3090

Hi all,

I’m encountering a significant performance discrepancy between our server with three NVIDIA RTX A6000 GPUs and a PC with a single NVIDIA GeForce RTX 3090 GPU. Despite understanding that the benchmarks should be relatively close (test on every GPU independently), the GPU RTX A6000 underperform compared to GPU RTX 3090 greatly. Below are the details of the tests and results we’ve obtained.

Benchmark Test Details

We ran a Python benchmark on both systems, performing matrix multiplication on two large random matrices. The script is as follows:

import torch
import time

def benchmark(device):
    if device != 'cpu':
        torch.cuda.set_device(device)
        device_name = torch.cuda.get_device_name(device)
    else:
        device_name = 'CPU'
    print(f"Running benchmark on {device_name}")

    size = 10000
    iterations = 10

    tensor_a = torch.randn(size, size, device=device)
    tensor_b = torch.randn(size, size, device=device)

    for _ in range(10):
        _ = torch.mm(tensor_a, tensor_b)

    start_time = time.time()
    for _ in range(iterations):
        _ = torch.mm(tensor_a, tensor_b)
    end_time = time.time()

    avg_time = (end_time - start_time) / iterations
    print(f"Average time for matrix multiplication of {size}x{size} tensors on {device_name}: {avg_time:.6f} seconds")

if __name__ == "__main__":
    if torch.cuda.is_available():
        num_gpus = torch.cuda.device_count()
        print(f"Number of GPUs available: {num_gpus}")
        for i in range(num_gpus):
            benchmark(i)
    else:
        print("CUDA is not available. Running benchmark on CPU only.")
        benchmark('cpu')

Benchmark Results

PC with one GPU (RTX 3090)

  • Number of GPUs available: 1
  • GPU (RTX 3090): Average time: 0.000008 seconds
  • CPU: Average time: 0.918097 seconds

Server with 3 GPUs (RTX A6000):

  • Number of GPUs available: 3
  • GPU 1 (RTX A6000): Average time: 0.000015 seconds
  • GPU 2 (RTX A6000): Average time: 0.000014 seconds
  • GPU 3 (RTX A6000): Average time: 0.000014 seconds
  • CPU: Average time: 0.693448 seconds

We also conducted similar MATLAB tests (multiply two large random matrixes) focusing on simple GPU operations:

MATLAB Results:

  • RTX 3090: Computation time: 0.48609 seconds
  • RTX A6000: Computation time: 1.0159 seconds

System Specifications

PC with one GPU (RTX 3090):

  • CPU: 12 cores @ 3.5GHz
  • GPU: NVIDIA GeForce RTX 3090 (24GB Memory)
  • Ubuntu 20.04.6 LTS
  • CUDA Version: 12.2
  • Driver Version: 535.183.01
  • python version: 3.9.7
  • matlab version: 2021a

Server with 3 GPUs (RTX A6000)

  • CPU: 2 sockets, 16 cores each @ 2.4GHz
  • GPU: 3x NVIDIA RTX A6000 (48GB Memory each)
  • Ubuntu 22.04.4 LTS
  • CUDA Version: 12.2
  • Driver Version: 535.183.01
  • python version: 3.9.7
  • matlab version: 2021a

The performance difference between the RTX A6000 and RTX 3090 is unexpected. We are seeking advice on:

  1. Potential bottlenecks or misconfigurations.
  2. Additional tests or diagnostics to perform.
  3. Any insights into why the RTX A6000 might underperform compared to the RTX 3090 in our use case.

Your expertise and suggestions would be immensely valuable in helping us resolve this issue. Thank you in advance for your assistance!

Best regards,
Zihan

Bunch of ideas:
How long are your tests running?
To make sure significant portion of benchmark is actually running on the GPU, vs loading overhead…
How busy is your CPU, monitor on core level, not averaged out across all cores, maybe some single core CPU bottleneck, your gaming CPU is 50% faster than the server CPU…
Have you monitored the GPUs while benching, clocks, temps powerdraw? Temp too high will keep clocks low, pcie bus low, so other things might actually be slowing the bench-run…
I’d expect roughly similar perf between these 2 GPUs, where the gaming one might clock a little higher …