How can I get 65Tflops performance with NVIDIA T4

surya.22091994 · November 7, 2020, 5:32am

Description

Hi, NVIDIA T4 datasheet shows that mixed precision can achieve 65 TFlops. I have run YoloV3 on P100 and T4 and both run at almost same speed. How do I get the performance mentioned in 65TFlops?

T4 Datasheet link: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf

P100 - ~160fps (FP16)
T4 - ~170 fps (FP16)

Also regarding the speed of network with different precisions, my assumption was that “Single > Mixed > double”.

How is it possible for me to replicate the 65Tflops performance on NVIDIA T4?

surya.22091994 · November 20, 2020, 2:25am

Hi,

Any information regarding this issue?

Thanks

AakankshaS · November 20, 2020, 5:31am

Hi @surya.22091994,
TFlops and fps are two different terms, Can you please elaborate more about the test run and how you are calculating TFlops from fps?

Thanks!

surya.22091994 · November 22, 2020, 5:43am

Hi @AakankshaS

Thanks for the reply. Following is how I guessed the performance of YOLO on T4

Assumptions:

I am running an FP16 model
Half precision perf(FP16) is greater than Mixed precision (FP16+FP32)
Tflops is directly proportional to the fps (unless there are any other bottlenecks - please mention if u think anything else might be bottleneck)
All the fps values mentioned are with 100% utilization of GPU
We might not be utlizing the tensorcores of T4 ( Could you provide some documentation on how to leverage tensorcore compoutation?)
Performance of T4 in datasheet for mixed precision ( FP16+FP32 ) is 65TFlops (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf)
P100 performance is mentioned here for half precision mentioned here as 18.7 Tflops https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf

Speed calculation:
if->
On a GPU which can provide 18.7Tflops of performance YOLO runs at 160fps with 100% GPU utilization
Then
Then on a GPU which can provide 65Tflops of performance YOLO should run at 555 fps with 100% GPU utilization ( With no bttlenecks)

So, I was asking how can I get 555 fps on T4. Could you please throw some light on where I might have gone wrong in case it’s wrong to expect 555 fps on T4?

AakankshaS · January 28, 2021, 4:53pm

Hi @surya.22091994 ,
Apologies for delayed response.
Are you still facing the issue?

Topic		Replies	Views
How to calculate the theoretical TFLOPS of Nvidia T4's mixed-precision? Deep Learning (Training & Inference) mixed-precision	0	928	February 19, 2020
Performance on T4 is over 3x slower than that on 2080Ti on DeepStream5 DeepStream SDK gstreamer	7	1899	October 12, 2021
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	4201	October 18, 2021
How to calculate TOPS (INT8) or TFLOPS (FP16) of each layer of a CNN using TensorRT Jetson AGX Xavier tensorrt	7	12720	September 12, 2021
How to calculate the final result of the number of flops using nvprof Jetson AGX Xavier tensorrt , yolo , profiling	3	2603	October 10, 2021
Mixed Precision (Tensor) vs raw FP16 / raw FP32 Compute Metrics Jetson AGX Xavier tensorrt , hw , cuda , jetson-inference	4	823	October 18, 2021
Deep Learning Inference: Performance validation on TX1 Jetson TX1	16	15254	November 2, 2021
Performance Expectation for Xavier NX Jetson Xavier NX tensorrt	2	512	October 18, 2021
How to test the fp16 benchmark performance on tx1？ Jetson TX1	2	812	October 18, 2021
Maximum Performance of YOLOv3 model for NVIDIA T4 in TensorRT using trtexec Deep Learning (Training & Inference)	1	913	October 5, 2020

How can I get 65Tflops performance with NVIDIA T4

Description

Related topics