Has Nvidia specified anywhere that Turing tensor cores have the same throughput as Volta ones?
I wouldn’t be surprised if Nvidia spent a bit less silicon on tensor cores in consumer graphics cards than in specialised (AI) compute cards - that would make a lot of sense actually.
what evidence do you have one way or the other? why should we have to speculate when Nvidia could just be transparent and document it SOMEHWERE. Since they have not, on face value a turing tensor core should be comparable to a volta tensor core, given the same feature name.
Historically, NVIDIA has been secretive with regard to details of their GPUs’ microarchitecture. I see nothing that would incentivize them to be more transparent at this time.
In practical terms, it would be best to file a performance bug, as it is possible that the software simply has not been sufficiently optimized for the new architecture. Experience indicates that NVIDIA operates the compute business driven by customer demand. So the more bugs are filed for a particular performance issue, the more likely a fix will materialize.
NVIDIA’s business is selling hardware; providing lots of performance software is just a means to that end. If new expensive parts lack application level performance, it will be in NVIDIA’s best interest to address the underlying issues so hardware sales remain brisk.
For what it’s worth, at least one review has made similar observations:
At reference specifications, peak theoretical tensor throughput is around 107.6 TFLOPS for the RTX 2080 Ti, 80.5 TFLOPS for the RTX 2080, and 59.7 TFLOPS for the RTX 2070. Unlike the 89% efficiency with the Titan V’s 97.5 TFLOPS, the RTX cards are essentially at half that level, with around 47%, 48%, and 45% efficiency for the RTX 2080 Ti, 2080, and 2070 respectively. A Turing-optimized binary should bring that up, though it is possible that the GeForce RTX cards may not be designed for efficient tensor FP16 operations as opposed to the INT dot-product acceleration. After all, the GeForce RTX cards are for consumers and ostensibly intended for inferencing rather than training, which is the reasoning for the new INT support in Turing tensor cores.
Table 4 on page 59 of the Turing GPU whitepaper specifies the tensor core peak FP16 throughput of the RTX 2080 as 80.5 TFLOPS with FP16 accumulate or 40.2 TFLOPS with FP32 accumulate.
If I remember correctly, Volta tensor core performance numbers were always given for accumulation in FP32. So it seems like you want to look out for benchmarks run with sm_75 code.
Apologies I edited my post after you cited it, as I noted the more relevant RTX 2080 specs in the whitepaper.
And now I note the even more relevant RTX 2080 Ti specs in table 1 on page 9: 107.6 TFLOPS with FP16 accumulate and 53.8 TFLOPS accumulating in FP32. So you really want benchmarks for sm_75.
I don’t see how idle speculation provides benefits to anyone (other than helping pass the time for retired folks like me :-)
From long experience I can say that marketing people will latch on to the highest number they see. That is usually some theoretical throughput number, or some “up to” peak performance number, neither of which are sustained in real-life scenarios. That doesn’t mean these numbers are wrong, just not useful for practical decision making. Decisions are best based on benchmarking one’s actual use case(s).
I repeat: The best course of action for perceived CUDA-related performance shortfalls is to notify NVIDIA in the form of bug reports (after performing due diligence), accompanied by sufficient amounts of supporting data. This course of action does not guarantee positive change, but it gives the best odds of such change.
Huffing and puffing and jumping up and down in forums (these or others) is unlikely to have any effect.