GU H100/L40S Performance

hugo.deprez · November 22, 2024, 5:07pm

Hello,

we rented three physical (dedicated) servers.
Server 1 NVIDIA GeForce RTX 4090
Server 2 NVIDIA L40S
Server 3 NVIDIA H100 PCIe

We were expecting tom improve performance with haigher hardware :
H100 we should be at level of 4x RTX4090 but we don’t see this.
We saw that L40S and H100 is slower than RTX 4090 which doesn’t make any sense for us.
Our server are running debian 12 with Driver Version: 565.57.01

my colleague write a python script to compare performance :

H100

FP16 Time: 3.6497561931610107
FP32 Time: 3.947096824645996

L40S

FP16 Time: 2.104823350906372
FP32 Time: 4.553493022918701

RTX4090

FP16 Time: 1.7705624103546143
FP32 Time: 3.454969882965088

Any idea on what can be the cause ? GPU configuration is needed ?

thank you !

striker159 · November 22, 2024, 5:48pm

This really depends on the type of calculations you are doing. From my experience, the RTX 4090 is the fastest card for “ordinary” calculations. L40S is the data center version of 4090 with more memory but with smaller power budget which reduces the performance.

For FP32, the H100 data sheet reports 67 TFLOPS. The techpowerup database mentions 82 TFLOPS for RTX 4090 .

H100 has a greater memory bandwidth and better hardware features for matrix multiplication.
Integer-based DPX instructions (both int and short2) can be faster than an equivalent algorithm using float and half2 on an RTX4090.

(Your fp16 performance for H100 seems odd compared fp32).

Curefab · November 22, 2024, 9:37pm

Is this for normal arithmetics or with Tensor Cores?

hugo.deprez · November 25, 2024, 1:09pm

Thank you for your feedback.
Myabe our tests isn’t good, we are looking into perf issue because we saw the same issue with our home made app : faster with 4090, or at least we don’t see much (any?) perf difference between models.

the codewe used looks like :

# Define model
model = torch.nn.Conv2d(3, 16, kernel_size=3).to(device)
data = torch.randn(32, 3, 224, 224).to(device)

# Measure FP16
model = model.half()
data_fp16 = data.half()
start = time.time()
for _ in range(10000):  # Dummy iterations
    _ = model(data_fp16)
print("FP16 Time:", time.time() - start)

# Measure FP32
model = model.float()  # Reset to FP32
data_fp32 = data.float()
start = time.time()
for _ in range(10000):
    _ = model(data_fp32)

I’ll have a look to TensorRT-LLM Benchmarking — tensorrt_llm documentation to have values more accurate.

Curefab · November 25, 2024, 1:22pm

Sometimes the newest hardware is not fully supported yet in frameworks.
Be sure to use the current pyTorch version.

With Compute Nsight you can check, whether the Tensor Cores were used and any other possible bottleneck (e.g. PCIe transfer speed).

Topic		Replies	Views
No speedup on L40s wrt RTX6000 Ada CUDA Programming and Performance	2	3354	April 1, 2024
Why 4090 training slower than P100 even writing same piece of code? GPU - Hardware	2	2302	February 20, 2025
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	6888	August 14, 2024
GV100 10x FP performance over RTX 3090 CUDA Programming and Performance	2	1113	October 9, 2020
Performance of 6000 Ada vs. H100 for multi-modal object detection training cuDNN	1	881	December 19, 2024
L40 vs. RTX 6000 Ada FP16/FP8 throughput? GPU - Hardware benchmarks	7	15135	April 4, 2023
NVIDIA Hopper Architecture In-Depth Technical Blog	3	1077	August 22, 2025
Low pf64 compute RTX 4090 Linux	0	274	January 24, 2023
Tensor core differences between L40 and L40S? (and RTX 6000 Ada?) CUDA Programming and Performance	0	9724	August 14, 2023
Requesting recommendation on selection between V100 vs T4 vs RTX2080 Ti vs Titan RTX for CUDA programming CUDA Programming and Performance	1	2330	March 5, 2019

GU H100/L40S Performance

Related topics