2080ti vs Titan V

LukeCuda · October 16, 2018, 7:15pm

Does anyone know why RTX is not using Tensor cores properly? Compute & Synthetics - The NVIDIA GeForce RTX 2070 Founders Edition Review: Mid-Range Turing, High-End Price

HGEMM performance 2080ti: 48K
HGEMM performance Titan V: 97K

2080ti should be similar or a bit slower than a Titan V given the number of tensor cores.

is it drivers? its actually nearly exactly half, which is suspicious! did Nvidia disable them quietly? people should know if thats the case.

tera · October 16, 2018, 8:22pm

Has Nvidia specified anywhere that Turing tensor cores have the same throughput as Volta ones?
I wouldn’t be surprised if Nvidia spent a bit less silicon on tensor cores in consumer graphics cards than in specialised (AI) compute cards - that would make a lot of sense actually.

LukeCuda · October 16, 2018, 8:37pm

what evidence do you have one way or the other? why should we have to speculate when Nvidia could just be transparent and document it SOMEHWERE. Since they have not, on face value a turing tensor core should be comparable to a volta tensor core, given the same feature name.

LukeCuda · October 16, 2018, 8:37pm

.

njuffa · October 16, 2018, 10:17pm

Historically, NVIDIA has been secretive with regard to details of their GPUs’ microarchitecture. I see nothing that would incentivize them to be more transparent at this time.

In practical terms, it would be best to file a performance bug, as it is possible that the software simply has not been sufficiently optimized for the new architecture. Experience indicates that NVIDIA operates the compute business driven by customer demand. So the more bugs are filed for a particular performance issue, the more likely a fix will materialize.

NVIDIA’s business is selling hardware; providing lots of performance software is just a means to that end. If new expensive parts lack application level performance, it will be in NVIDIA’s best interest to address the underlying issues so hardware sales remain brisk.

For what it’s worth, at least one review has made similar observations:

At reference specifications, peak theoretical tensor throughput is around 107.6 TFLOPS for the RTX 2080 Ti, 80.5 TFLOPS for the RTX 2080, and 59.7 TFLOPS for the RTX 2070. Unlike the 89% efficiency with the Titan V’s 97.5 TFLOPS, the RTX cards are essentially at half that level, with around 47%, 48%, and 45% efficiency for the RTX 2080 Ti, 2080, and 2070 respectively. A Turing-optimized binary should bring that up, though it is possible that the GeForce RTX cards may not be designed for efficient tensor FP16 operations as opposed to the INT dot-product acceleration. After all, the GeForce RTX cards are for consumers and ostensibly intended for inferencing rather than training, which is the reasoning for the new INT support in Turing tensor cores.

LukeCuda · October 17, 2018, 10:30am

you can benchmark CUDA 10 cublasTensorCore examples, and 2080ti is half the Titan V.

so Nvidia has really screwed up.

they should be called TenCores not TensorCores because they are half missing.

cbuchner1 · October 17, 2018, 10:46am

@LukeCuda, you missed the most obvious pun. Call them “Sores” for what they are.

As far as I know the AnandTech benchmarks have so far been made with code built for sm_70 (Volta). Can anyone confirm the same bad performance with sm_75 optimized code?

SPWorley · October 17, 2018, 11:37am

The throughput of Turing’s tensor cores is described much more explicitly than most other GPU architecture throughput details usually are.

tera · October 17, 2018, 7:36pm

Table 4 on page 59 of the Turing GPU whitepaper specifies the tensor core peak FP16 throughput of the RTX 2080 as 80.5 TFLOPS with FP16 accumulate or 40.2 TFLOPS with FP32 accumulate.
If I remember correctly, Volta tensor core performance numbers were always given for accumulation in FP32. So it seems like you want to look out for benchmarks run with sm_75 code.

LukeCuda · October 17, 2018, 7:43pm

NVIDIA has advertised FP16 113.8Tflops which is comparable to Titan V. Thats all that matters to make 2080Ti as fast as a Titan V when doing HGEMM.

Above, Tera said the Turing Tesla T4 is advertised at 60Tflops. Which would match what a 2080Ti is currently doing. But Nvidia would have advertised 60TFlops, not 113.8Tflops

could the cuda drivers be miss identifying 2080ti as a T4?

or did Nvidia do a swifty and base 2080ti off an inference chip, and not tell anyone?!!

tera · October 17, 2018, 7:58pm

Apologies I edited my post after you cited it, as I noted the more relevant RTX 2080 specs in the whitepaper.
And now I note the even more relevant RTX 2080 Ti specs in table 1 on page 9: 107.6 TFLOPS with FP16 accumulate and 53.8 TFLOPS accumulating in FP32. So you really want benchmarks for sm_75.

njuffa · October 17, 2018, 8:00pm

I don’t see how idle speculation provides benefits to anyone (other than helping pass the time for retired folks like me :-)

From long experience I can say that marketing people will latch on to the highest number they see. That is usually some theoretical throughput number, or some “up to” peak performance number, neither of which are sustained in real-life scenarios. That doesn’t mean these numbers are wrong, just not useful for practical decision making. Decisions are best based on benchmarking one’s actual use case(s).

I repeat: The best course of action for perceived CUDA-related performance shortfalls is to notify NVIDIA in the form of bug reports (after performing due diligence), accompanied by sufficient amounts of supporting data. This course of action does not guarantee positive change, but it gives the best odds of such change.

Huffing and puffing and jumping up and down in forums (these or others) is unlikely to have any effect.

LukeCuda · October 25, 2018, 12:21am

There is some discussion that a 2080Ti RTX Tensor Core is not the same as a Quadro RTX Tensor Core, and that is why 2080Ti is not performing as advertised in CUDA.

Anyone have information in this regard?

cbuchner1 · October 25, 2018, 9:13am

@LukeCuda: I doubt this as I think the Quadro and RTX line are based off the same die.

(EDIT: notable exceptions being the most expensive Quadro GP100/GV100 models with HBM2 memory which are using the P100 and V100 chips)

LukeCuda · October 25, 2018, 9:53am

I was trying very hard to find out why 2080Ti tensor cores were half as fast as Titan V tensor cores.

The reason is that they can only do FP32 accumulate at half speed. Titan V tensors and infact Quadro RTX tensors(!!) do full speed.

So they did gimp the tensor cores for the consumer models of RTX.

tera · October 25, 2018, 10:13am

If you check the reference in posts #9 / #11 above, you will find this is documented behaviour.

LukeCuda · October 25, 2018, 10:22am

yes you are absolutely correct. i did not see that earlier. i think this is the end of the mystery. it is documented so i shouldn’t be too hard on nvidia.

Topic		Replies	Views
Titan RTX and Titan V CUDA Programming and Performance	18	12696	August 11, 2019
Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication? CUDA Programming and Performance	12	2662	October 25, 2018
Titan V slower than 1080ti tensorflow:18.08-py3 and 396.54 drivers Frameworks tensorflow	21	10344	October 12, 2021
RTX arrangement CUDA Programming and Performance	11	926	January 23, 2020
TITAN V Announced - 15.0 TFLOPs FP32, 5120 Cores, 12 GB HBM2 VRAM, $3000 US Price CUDA Programming and Performance	24	5279	December 22, 2017
GTX 1080ti CUDA Programming and Performance	14	6036	March 11, 2017
TitanX slower than CPU (Tensorflow), possible configuration issue CUDA Programming and Performance	9	4509	April 13, 2016
How to achieve peak tensor core utilization TensorRT	1	675	September 20, 2022
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13180	May 20, 2012
Final word on Titan X and TCC? CUDA Programming and Performance	17	10998	September 3, 2018

2080ti vs Titan V

Related topics