How can I get 65Tflops performance with NVIDIA T4


Hi, NVIDIA T4 datasheet shows that mixed precision can achieve 65 TFlops. I have run YoloV3 on P100 and T4 and both run at almost same speed. How do I get the performance mentioned in 65TFlops?

T4 Datasheet link:

P100 - ~160fps (FP16)
T4 - ~170 fps (FP16)

Also regarding the speed of network with different precisions, my assumption was that “Single > Mixed > double”.

How is it possible for me to replicate the 65Tflops performance on NVIDIA T4?


Any information regarding this issue?


Hi @surya.22091994,
TFlops and fps are two different terms, Can you please elaborate more about the test run and how you are calculating TFlops from fps?


Hi @AakankshaS

Thanks for the reply. Following is how I guessed the performance of YOLO on T4


  1. I am running an FP16 model
  2. Half precision perf(FP16) is greater than Mixed precision (FP16+FP32)
  3. Tflops is directly proportional to the fps (unless there are any other bottlenecks - please mention if u think anything else might be bottleneck)
  4. All the fps values mentioned are with 100% utilization of GPU
  5. We might not be utlizing the tensorcores of T4 ( Could you provide some documentation on how to leverage tensorcore compoutation?)
  6. Performance of T4 in datasheet for mixed precision ( FP16+FP32 ) is 65TFlops (
  7. P100 performance is mentioned here for half precision mentioned here as 18.7 Tflops

Speed calculation:
On a GPU which can provide 18.7Tflops of performance YOLO runs at 160fps with 100% GPU utilization
Then on a GPU which can provide 65Tflops of performance YOLO should run at 555 fps with 100% GPU utilization ( With no bttlenecks)

So, I was asking how can I get 555 fps on T4. Could you please throw some light on where I might have gone wrong in case it’s wrong to expect 555 fps on T4?

Hi @surya.22091994 ,
Apologies for delayed response.
Are you still facing the issue?