Tesla V100 is slower than RTX 2080ti


I tested performance of V100 and 2080ti using TensorRT and pyCuda. The tested model was ResNet50 and Inception_v1.
But in my code, V100 was slower than 2080ti. In many references, V100 has high throughput more than 2080ti always.
I think, It seems like that I can not use TensorRt and Cuda appropriately. How can I use them properly?

If you want any further information like my code or frozen graph, please let me know.


How big is the performance difference? By raw specs, the V100 and the RTX 2080 Ti would appear to offer roughly equal performance for non-double-precision computation. Exact comparison is difficult because it is not know what kind of clock boost is applied on a specific GPU for a given workload and specific operating comnditions.

I checked latency and throughput in application. And I got the results below:

  1. V100
  • (INT8 / Batch Size=1)
    Latency: 0.68 ms / Throughput: 1442.46 fps
  • (INT8 / Batch Size=128)
    Latency: 16.15 ms / Throughput: 7909.67 fps
  1. 2080TI
  • (INT8 / Batch Size=1)
    Latency: 0.52 ms / Throughput: 1980.23 fps
  • (INT8 / Batch Size=128)
    Latency: 13.69 ms / Throughput: 9390.88fps

The model is Inception_v1.
I read document “NVIDIA AI INFERENCE
PLATFORM” (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/t4-inference-print-update-inference-tech-overview-final.pdf).
In this doc, Tesla V100’s performance has 11,280 fps in case of batch size=128. How can I achieve this performance?

The numbers will be useful for anybody who has relevant experience with these benchmarks.

I have never run these benchmarks. My first thought would be that you are using a different hardware and / or software configuration than what was used to generate the benchmark report. Have you looked into this aspect?

Very generally speaking, a high-frequency CPU (I would suggest >= 3.5 GHz base frequency), large amounts of low-latency high-bandwidth DDR4 system memory, and NVMe solid-state storage should help machine-learning application performance. But I have zero insight as to whether they have any impact on these particular benchmarks and if so, how much.

Thank you njuffa.

your reply helps me a lot.
I’ll trying to find additional information.


Have you considered turning off the ECC mode for the V100 memory? This might result in a slight speed boost.