we rented three physical (dedicated) servers.
Server 1 NVIDIA GeForce RTX 4090
Server 2 NVIDIA L40S
Server 3 NVIDIA H100 PCIe
We were expecting tom improve performance with haigher hardware :
H100 we should be at level of 4x RTX4090 but we don’t see this.
We saw that L40S and H100 is slower than RTX 4090 which doesn’t make any sense for us.
Our server are running debian 12 with Driver Version: 565.57.01
my colleague write a python script to compare performance :
This really depends on the type of calculations you are doing. From my experience, the RTX 4090 is the fastest card for “ordinary” calculations. L40S is the data center version of 4090 with more memory but with smaller power budget which reduces the performance.
H100 has a greater memory bandwidth and better hardware features for matrix multiplication.
Integer-based DPX instructions (both int and short2) can be faster than an equivalent algorithm using float and half2 on an RTX4090.
(Your fp16 performance for H100 seems odd compared fp32).
Thank you for your feedback.
Myabe our tests isn’t good, we are looking into perf issue because we saw the same issue with our home made app : faster with 4090, or at least we don’t see much (any?) perf difference between models.
the codewe used looks like :
# Define model
model = torch.nn.Conv2d(3, 16, kernel_size=3).to(device)
data = torch.randn(32, 3, 224, 224).to(device)
# Measure FP16
model = model.half()
data_fp16 = data.half()
start = time.time()
for _ in range(10000): # Dummy iterations
_ = model(data_fp16)
print("FP16 Time:", time.time() - start)
# Measure FP32
model = model.float() # Reset to FP32
data_fp32 = data.float()
start = time.time()
for _ in range(10000):
_ = model(data_fp32)