GV100 10x FP performance over RTX 3090

My question here is simple. Is GV100 10x faster than RTX3090 for floating point workloads. I find this amazing! I have used nSight Compute to characterize my code which is pure computation in FP32 and FP64 with very little memory traffic and using only 36 registers. I see FP utilization at 86% and warp occupancy at near 32. This is what I expect to see on this workload but it is running more than 10x slower than the same code on GV100. I’m using CUDA 11.1 on Windows 10 for the RTX and CUDA 11.0 on Windows 10 for the GV100. I’ve see a similar slowdown on Ubuntu 20.04. I’m using Windows driver 456.71.

Is that floating-point workload using a significant amount of double computation by any chance? The double-precision compute throughput of all consumer GPUs is severely restricted compared to high-end professional compute hardware. An RTX 3090 delivers less than 1.0 DP TFLOPS, a GV100 around 7.5 DP TFLOPS.

Specifically, the RTX 3090 DP throughput is 1/32 of its SP throughput. The different performance levels are reflected in the price: an RTX 3090 is around $1500, I think, a GV100 based GPU will set you back $8000+.

The follow-on to GV100 is the new A100 with 9.7 DP TFLOPS (potentially more if you use Tensor cores). I haven’t seen pricing for individual cards yet, and NVIDIA’s website currently lists no PCIe pluggable GPUs based on it.

Yes, it is mixed 64/32 floating point. I just finally compared the benchmarks and indeed the 64 bit FP in RTX 3090 is roughly 10x slower. Thanx for pointing it out.