GV100 10x FP performance over RTX 3090

rich · October 9, 2020, 9:43pm

My question here is simple. Is GV100 10x faster than RTX3090 for floating point workloads. I find this amazing! I have used nSight Compute to characterize my code which is pure computation in FP32 and FP64 with very little memory traffic and using only 36 registers. I see FP utilization at 86% and warp occupancy at near 32. This is what I expect to see on this workload but it is running more than 10x slower than the same code on GV100. I’m using CUDA 11.1 on Windows 10 for the RTX and CUDA 11.0 on Windows 10 for the GV100. I’ve see a similar slowdown on Ubuntu 20.04. I’m using Windows driver 456.71.

njuffa · October 9, 2020, 10:44pm

Is that floating-point workload using a significant amount of double computation by any chance? The double-precision compute throughput of all consumer GPUs is severely restricted compared to high-end professional compute hardware. An RTX 3090 delivers less than 1.0 DP TFLOPS, a GV100 around 7.5 DP TFLOPS.

Specifically, the RTX 3090 DP throughput is 1/32 of its SP throughput. The different performance levels are reflected in the price: an RTX 3090 is around $1500, I think, a GV100 based GPU will set you back $8000+.

The follow-on to GV100 is the new A100 with 9.7 DP TFLOPS (potentially more if you use Tensor cores). I haven’t seen pricing for individual cards yet, and NVIDIA’s website currently lists no PCIe pluggable GPUs based on it.

rich · October 9, 2020, 10:52pm

Yes, it is mixed 64/32 floating point. I just finally compared the benchmarks and indeed the 64 bit FP in RTX 3090 is roughly 10x slower. Thanx for pointing it out.

Topic		Replies	Views
GU H100/L40S Performance CUDA Programming and Performance	4	1161	November 25, 2024
Why 4090 training slower than P100 even writing same piece of code? GPU - Hardware	2	2603	February 20, 2025
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	3291	September 11, 2023
Does the GTX1060 support double precision? CUDA Programming and Performance	4	11349	February 24, 2017
RTX3090 runs slower than RTX2080ti CUDA Programming and Performance	5	1956	June 6, 2021
RTX 3070 with CUDA10.0 compatibility [UbuntuOS, any version] Linux	15	11847	February 25, 2021
RTX 3090 Peak Performance GPU-Accelerated Libraries cutensor	1	9352	December 14, 2021
Requesting recommendation on selection between V100 vs T4 vs RTX2080 Ti vs Titan RTX for CUDA programming CUDA Programming and Performance	1	2541	March 5, 2019
I'm novice, please help -- pure performance CUDA Programming and Performance	17	402	October 30, 2024
RTX 6000 Vs. Titan V: TFLOPS disclosure GPU-Accelerated Libraries	3	902	August 22, 2019

GV100 10x FP performance over RTX 3090

Related topics