Used yolo onnx to build fp16, and in8 trt. and use nvdsparse_yolo.cpp and nvdsparse_yolo.cu code to infer

hardware:
v100
cuda: 11.7
tensorrt: 8.4
ubuntu

i did some test
used yolo onnx to build fp16, and in8 trt. and use nvdsparse_yolo.cpp and nvdsparse_yolo.cu code to infer.
i got some result, but i did not understand this result. could you give some idea or explan for me . thank you so much
result like this:
test 1: fp16 trt and nvdsparse_yolo.cpp; usage: CPU:43%,GPU:11%,SM:34%
test 2: fp16 trt and nvdsparse_yolo.cu; usage: CPU:46%,GPU:11%,SM:34%
test 3: int8 trt and nvdsparse_yolo.cpp; usage: CPU:19%,GPU:68%,SM:77%
test 4: int8 trt and nvdsparse_yolo.cu; usage: CPU:20%,GPU:63%,SM:66%
I did not understand why int8 cost so much GPU , SM usage, and use less CPU usage, and why .cpp and .cu code cost the similar usage .

Hi @15900963082 ,
could you provide the source code of nvdsparse_yolo.cpp and nvdsparse_yolo.cu ? Are they created by TRT team or did you write them?
Thanks

Also, GPU util usually isn’t a good indicator for perf because low GPU util may be a direct result of poorly optimized kernel. usually you want more GPU util which means the kernels have been more optimized.
therefore, what we should really measure is the perf numbers in terms of latency or throughput