Yolov4 TensorRT slower than Yolov4 darknet


I’ve converted a yolov4 darknet model to an onnx using some conversion script I found online (https://github.com/Tianxiaomo/pytorch-YOLOv4). I then run the generated onnx using tensorRT. I had no errors during the conversion. All output from the tensorRT inference of yolov4 are as it should be so I don’t think I’m doing anything wrong there.

I’ve observed that the speed is not as it should be. I’m working under the assumption that the tensorRT engines should run faster than the darknet models.

Here are some numbers:

yolov4_416_416: 40ms (using darknet)
yolov4_416_416: 45ms (using onnx -> converted to tensorRT engine)

The yolov4 tensorRT engine seems to be running slower than the yolov4 darknet. Any reason why?

For comparison, I used a yolov3 onnx (converted using a different script). The tensorRT engine runs faster than the darknet in this case. I use the same inference script as I did for the yolov4.

yolov3_416_416: 38ms (using darknet)
yolov3_416_416: 30ms (using onnx - converted to tensorRT engine)

*I used input size of 416 * 416 * 3 and batch size of 4 for my tensorRT results.


TensorRT Version:
GPU Type: GTX 1060
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: Windows 10
Python Version (if applicable): Python 3.6.6
TensorFlow Version (if applicable): 1.5.0
PyTorch Version (if applicable): 1.4.0
Baremetal or Container (if container which image + tag):

Hi @ColinPs26kt,
Can you please help me with both the onnx models and scripts.
Meanwhile you can check for the best practices to improve performance of your engine.


Hi @AakankshaS,

Thanks for the link and for following up. I’m familiar with the idea of batching for optimization in TensorRT. I’m using a batch of 4 images from now. I am attaching my script here. onnxExpt.cpp (11.0 KB)

The script is a simplified implementation of what I have so far. I load an image “TestImage.jpg”, do some post processing using opencv functions and copy the image over to the GPU for inference. I’ve removed post-processing steps to keep things simpler. I measure the time elapsed from the time when the enqueue function is called to when the stream is synchronized and the output is copied out to the host.

I have also attached a model (yolov4_4_3_416_416.onnx). My script converts this onnx model to a .trt file.

My question is why the yolov4 tensorRT runs slower than the yolov4 darknet. I saw a similar open issue regarding yolov4 in another discussion thread (TensorRT model inference is slower than normal model).

Hi @ColinPs26kt,
Looks like you missed to attach your onnx model.

Hi @AakankshaS,

Sorry about that. The file is too large. I’ll send a link to a oneDrive.



The pre and post-processing steps depend so strongly on the particular application, we mostly consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
We suggest you to check the profiling data to see the bottleneck and share data as well along with test image so we can better help.

@ColinPs26kt check this ?