No speedup on L40s wrt RTX6000 Ada

Hi,
I recently purchased four L40s GPUs and was trying to benchmark with two other GPUs I had(RTX4090 and RTX6000 Ada). According to available benchmarks (GPU Benchmarks for Deep Learning | Lambda, Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?), the L40s should give a close to 1.5x speedup on RTX4090, RTX6000 Ada. But across multiple workloads, I am observing the contrary, i.e., L40s is performing at 80% of the speed of RTX6000 Ada. I am not sure if I am missing something in this analysis.

Analysis: Dropbox - Training time for different workloads(in secs).png - Simplify your life

Machine specs of L40s:
Processor: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
CPU memory: 1007 GB
No of GPUs: 4

Machine specs of RTX6000 Ada:
Processor: AMD Ryzen Threadripper PRO 5955WX 16-Cores
CPU memory: 504 GB
No of GPUs: 3

Machine specs of RTX4090:
Processor: 13th Gen Intel(R) Core™ i9-13900KF
CPU memory: 31.1 GB
No of GPUs: 1

Repositories used for testing: InstantNGP(GitHub - NVlabs/instant-ngp: Instant neural graphics primitives: lightning fast NeRF and more), 3DGS(GitHub - graphdeco-inria/gaussian-splatting: Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"), ResNet(GitHub - kuangliu/pytorch-cifar: 95.47% on CIFAR10 with PyTorch)

L40s gets into the ballpark (0.8x to 1.2x) of A100 performance only for specific workloads. In particular, workloads that involve LLM/Transformer training using FP8 (and perhaps some other examples of AI training that uses FP8, and perhaps other FP8-heavy workloads), compared to the same workload on A100 using FP16 because FP8 is not available on A100.

If we compare datasheets, the RTX6000 Ada and L40s and A100 for selected specifications compare as follows:

                           RTX6000 Ada    L40s   A100 (80GB SXM)
memory bandwidth (GB/s):       960         864    2039
peak FP64 (TF):                 <1          <1    19.5
peak FP16 Tensor (dense, TF):   NS         362     312
peak FP8 Tensor (sparse, TF): 1457        1466      NA

NA - A100 does not have this compute path
NS - not specified in the datasheet, but we can deduce it would be similar to L40s.
<1 - also not specified, but it is a very low number

You cannot/should not conclude that A100 and L40s have equal performance across diverse workloads. Just to pick obvious examples, A100 will be considerably faster than L40s in any workload where FP64 matters most, or any workload where memory bandwidth matters most.

I don’t know of any reason that L40s should give a 1.5x speedup compared to RTX6000 Ada. I don’t believe there is any support for that idea, benchmarking-wise or architecturally. Something closer to 1:1 would be expected. The cards are similar in various ways. It’s possible that RTX6000 Ada is slightly faster (e.g for workloads where memory bandwidth is an important factor).

Hi Robert,
Thanks for your response. Individually looking at the ResNet workload(training with single precision, i.e., FP32), this(GPU Benchmarks for Deep Learning | Lambda) shows that A100 is close to 1.5x faster than RTX6000 Ada, while this(Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?) shows that A100 and L40s have similar speed(though this is for fp16, I am guessing it is similar for fp32 as well). This was my reason for the doubt.

Would you suggest any other workload I can use to benchmark these GPUs? My primary concern is to make sure I have set up the L40s GPUs properly(in terms of driver and CUDA) and to understand the best speed boost possible with L40s.