GU H100/L40S Performance

Hello,

we rented three physical (dedicated) servers.
Server 1 NVIDIA GeForce RTX 4090
Server 2 NVIDIA L40S
Server 3 NVIDIA H100 PCIe

We were expecting tom improve performance with haigher hardware :
H100 we should be at level of 4x RTX4090 but we don’t see this.
We saw that L40S and H100 is slower than RTX 4090 which doesn’t make any sense for us.
Our server are running debian 12 with Driver Version: 565.57.01

my colleague write a python script to compare performance :

  • H100
FP16 Time: 3.6497561931610107
FP32 Time: 3.947096824645996
  • L40S
FP16 Time: 2.104823350906372
FP32 Time: 4.553493022918701
  • RTX4090
FP16 Time: 1.7705624103546143
FP32 Time: 3.454969882965088

Any idea on what can be the cause ? GPU configuration is needed ?

thank you !

This really depends on the type of calculations you are doing. From my experience, the RTX 4090 is the fastest card for “ordinary” calculations. L40S is the data center version of 4090 with more memory but with smaller power budget which reduces the performance.

For FP32, the H100 data sheet reports 67 TFLOPS. The techpowerup database mentions 82 TFLOPS for RTX 4090 .

H100 has a greater memory bandwidth and better hardware features for matrix multiplication.
Integer-based DPX instructions (both int and short2) can be faster than an equivalent algorithm using float and half2 on an RTX4090.

(Your fp16 performance for H100 seems odd compared fp32).

Is this for normal arithmetics or with Tensor Cores?

Thank you for your feedback.
Myabe our tests isn’t good, we are looking into perf issue because we saw the same issue with our home made app : faster with 4090, or at least we don’t see much (any?) perf difference between models.

the codewe used looks like :

# Define model
model = torch.nn.Conv2d(3, 16, kernel_size=3).to(device)
data = torch.randn(32, 3, 224, 224).to(device)

# Measure FP16
model = model.half()
data_fp16 = data.half()
start = time.time()
for _ in range(10000):  # Dummy iterations
    _ = model(data_fp16)
print("FP16 Time:", time.time() - start)

# Measure FP32
model = model.float()  # Reset to FP32
data_fp32 = data.float()
start = time.time()
for _ in range(10000):
    _ = model(data_fp32)

I’ll have a look to TensorRT-LLM Benchmarking — tensorrt_llm documentation to have values more accurate.

Sometimes the newest hardware is not fully supported yet in frameworks.
Be sure to use the current pyTorch version.

With Compute Nsight you can check, whether the Tensor Cores were used and any other possible bottleneck (e.g. PCIe transfer speed).