Hi, NVIDIA T4 datasheet shows that mixed precision can achieve 65 TFlops. I have run YoloV3 on P100 and T4 and both run at almost same speed. How do I get the performance mentioned in 65TFlops?

T4 Datasheet link:

P100 - ~160fps (FP16)
T4 - ~170 fps (FP16)

Also regarding the speed of network with different precisions, my assumption was that “Single > Mixed > double”.

TFlops and fps are two different terms, Can you please elaborate more about the test run and how you are calculating TFlops from fps?


Thanks for the reply. Following is how I guessed the performance of YOLO on T4


  1. I am running an FP16 model
  2. Half precision perf(FP16) is greater than Mixed precision (FP16+FP32)
  3. Tflops is directly proportional to the fps (unless there are any other bottlenecks - please mention if u think anything else might be bottleneck)
  4. All the fps values mentioned are with 100% utilization of GPU
  5. We might not be utlizing the tensorcores of T4 ( Could you provide some documentation on how to leverage tensorcore compoutation?)
  6. Performance of T4 in datasheet for mixed precision ( FP16+FP32 ) is 65TFlops (
  7. P100 performance is mentioned here for half precision mentioned here as 18.7 Tflops

Speed calculation:
On a GPU which can provide 18.7Tflops of performance YOLO runs at 160fps with 100% GPU utilization
Then on a GPU which can provide 65Tflops of performance YOLO should run at 555 fps with 100% GPU utilization ( With no bttlenecks)

So, I was asking how can I get 555 fps on T4. Could you please throw some light on where I might have gone wrong in case it’s wrong to expect 555 fps on T4?

