No speedup on L40s wrt RTX6000 Ada

shrisudhan07 · April 1, 2024, 7:40pm

Hi,
I recently purchased four L40s GPUs and was trying to benchmark with two other GPUs I had(RTX4090 and RTX6000 Ada). According to available benchmarks (GPU Benchmarks for Deep Learning | Lambda, Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?), the L40s should give a close to 1.5x speedup on RTX4090, RTX6000 Ada. But across multiple workloads, I am observing the contrary, i.e., L40s is performing at 80% of the speed of RTX6000 Ada. I am not sure if I am missing something in this analysis.

Analysis: Dropbox - Training time for different workloads(in secs).png - Simplify your life

Machine specs of L40s:
Processor: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
CPU memory: 1007 GB
No of GPUs: 4

Machine specs of RTX6000 Ada:
Processor: AMD Ryzen Threadripper PRO 5955WX 16-Cores
CPU memory: 504 GB
No of GPUs: 3

Machine specs of RTX4090:
Processor: 13th Gen Intel(R) Core™ i9-13900KF
CPU memory: 31.1 GB
No of GPUs: 1

Repositories used for testing: InstantNGP(GitHub - NVlabs/instant-ngp: Instant neural graphics primitives: lightning fast NeRF and more), 3DGS(GitHub - graphdeco-inria/gaussian-splatting: Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"), ResNet(GitHub - kuangliu/pytorch-cifar: 95.47% on CIFAR10 with PyTorch)

Robert_Crovella · April 1, 2024, 8:32pm

L40s gets into the ballpark (0.8x to 1.2x) of A100 performance only for specific workloads. In particular, workloads that involve LLM/Transformer training using FP8 (and perhaps some other examples of AI training that uses FP8, and perhaps other FP8-heavy workloads), compared to the same workload on A100 using FP16 because FP8 is not available on A100.

If we compare datasheets, the RTX6000 Ada and L40s and A100 for selected specifications compare as follows:

                           RTX6000 Ada    L40s   A100 (80GB SXM)
memory bandwidth (GB/s):       960         864    2039
peak FP64 (TF):                 <1          <1    19.5
peak FP16 Tensor (dense, TF):   NS         362     312
peak FP8 Tensor (sparse, TF): 1457        1466      NA

NA - A100 does not have this compute path
NS - not specified in the datasheet, but we can deduce it would be similar to L40s.
<1 - also not specified, but it is a very low number

You cannot/should not conclude that A100 and L40s have equal performance across diverse workloads. Just to pick obvious examples, A100 will be considerably faster than L40s in any workload where FP64 matters most, or any workload where memory bandwidth matters most.

I don’t know of any reason that L40s should give a 1.5x speedup compared to RTX6000 Ada. I don’t believe there is any support for that idea, benchmarking-wise or architecturally. Something closer to 1:1 would be expected. The cards are similar in various ways. It’s possible that RTX6000 Ada is slightly faster (e.g for workloads where memory bandwidth is an important factor).

shrisudhan07 · April 1, 2024, 11:08pm

Hi Robert,
Thanks for your response. Individually looking at the ResNet workload(training with single precision, i.e., FP32), this(GPU Benchmarks for Deep Learning | Lambda) shows that A100 is close to 1.5x faster than RTX6000 Ada, while this(Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?) shows that A100 and L40s have similar speed(though this is for fp16, I am guessing it is similar for fp32 as well). This was my reason for the doubt.

Would you suggest any other workload I can use to benchmark these GPUs? My primary concern is to make sure I have set up the L40s GPUs properly(in terms of driver and CUDA) and to understand the best speed boost possible with L40s.

Topic		Replies	Views
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	25935	November 4, 2010
Tensor core differences between L40 and L40S? (and RTX 6000 Ada?) CUDA Programming and Performance	0	9768	August 14, 2023
L40 vs. RTX 6000 Ada FP16/FP8 throughput? GPU - Hardware benchmarks	7	15434	April 4, 2023
Disappointed performance using C2050 CUDA Programming and Performance	20	7837	September 2, 2010
GTX480 NDA expired.. reviews everywhere! CUDA Programming and Performance	16	7173	March 29, 2010
GTX 460 CUDA Programming and Performance	58	60277	August 5, 2010
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46660	May 6, 2010
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23044	October 12, 2010
GTX 470 Seems Slow... No Better than GTX 260? CUDA Programming and Performance	5	9342	April 29, 2010
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	3959	November 7, 2010

No speedup on L40s wrt RTX6000 Ada

Related topics