Difference in mAP between tlt evaluate and tlt inference

Hi there,

We have been using Darknet framework for a while and are thinking on migrating to TLT 3.0 or (TAO now). We are doing an accuracy comparison with our previous YOLOv4 models trained on our person dataset.

In order to compare trainings, framework etc… We have a python script which compute PASCAL2010 mAP. This script has been well tested and gives the exact same mAP than Darknet official repo (darknet detector map command) with the same weights.

We wanted to compute TLT accuracy using our script as well (to make a fair comparison with our previous results, we need to use the same tool).
To do so we first run tlt inference, then use the bounding boxes coordinates generated in the labels folder (KITTI format) and converted them to the Darknet format. Finally compute the Pascal2010 mAP using our script.

In addition we also have computed the mAP using TRT. We have converted TLT file into TRT, run the inferences, scale and converted bounding boxes and finally compute the mAP using our script.
We also get the same score than tlt evaluate

To summarise, the problem is we are unable to get the same mAP by running tlt evaluate or trt inference + script VS tlt inference + script.


  • tlt evaluate => mAP@0.5 = 84.5%
  • trt inference + script => mAP@0.5 = 84.5%
  • tlt inference + script => mAP@0.5 = 77.8%

To remove possible errors, we have so far confirmed:

  • mAP computed by our script is the same than Darknet official repo
  • Predicted bounding boxes conversion from KITTI to Darknet format is correct, we have checked is visually.
  • Compare tlt evaluate vs trt inference + script => accuracy is equal

Do you have any insights on why we would get such a difference? Is there any post processing done is tlt inference not present in tlt evaluate and trt inference o vice-versa?

Thank you for your help.

Can you run trt evaluate too?
See YOLOv4 — TAO Toolkit 3.0 documentation

  • -m, --model : The path to the model file to use for evaluation. The model can be either a .tlt model file or TensorRT engine.

Can you run trt evaluate too?

You mean tao evaluate I guess right? Yes I did run tao evaluate which gives the same results than tlt evaluate → mAP@0.5 = 84.5%.
But I don’t really understand the point of trying that. As I mentioned in the post, we can confirm tao evaluate gives us the same accuracy than trt inference + mAP script. But not tlt inference which is our concern.

I mean, running “tlt evaluate” or “tao evaluate” against the tensorrt engine.

Ok I have run:
tao yolo_v4 evaluate -e /workspace/tlt-experiments/trainings/conf/yolov4_config.yml -m /workspace/tlt-experiments/trainings/yolov4_cspdarknet53_fp32.engine -k key

And I get mAP@0.5 = 84.3%, pretty close from the tao evaluate with .tlt (84.5%).

Thanks, so your result is as below.

  • tlt evaluate => mAP@0.5 = 84.5%
  • trt evaluate => mAP@0.5 = 84.3%
  • trt inference + script => mAP@0.5 = 84.5%
  • tlt inference + script => mAP@0.5 = 77.8%

Is it possible to share the script?

Can you try to change matching_iou_threshold and execute below experiment again?

tlt evaluate
tlt inference + script

Not really because it is the the company code. But the script has been tested against Darknet official repo and it’s giving the same mAP on the same test set…

I have also re-run the tests on just 10 images with 339 labelled boxes, to remove some possible errors:

  • tao evaluate .tlt → mAP@0.5 = 66.3%
  • tao evaluate .engine → mAP@0.5 = 59.2%
  • trt inference + script → mAP@0.5 = 58.1%
  • tao inference + script → mAP@0.5 = 54.8%

The two TensorRT accuracies are pretty close but there is still a big difference between tao evaluate and tao inference + script

Can you try to change matching_iou_threshold and execute below experiment again?

tlt evaluate
tlt inference + script

I don’t really understand the point of doing that. If I change the matching_iou_threshold to 0.7 for example, I’m not computing mAP@0.5 anymore but mAP@0.7 so they will still be a difference between tlt evaluate and tlt inference + script . What is your thought behind that?

Also is tao evaluate directly calling the inference function() same as tao inference? If so and because we are using the same config for both, is there any processing done in tao evaluate which is not applied in tao inference?
Thank you

Just want to know during different thresholds, how much the difference will be between tlt evaluate and tlt inference+script.
The tao evaluate and tao inference are different applications. When they run inference against .tlt model, there are some differences in images preprocessing and predicted labels postprocessing.

Here are few results with different matching_iou_threshold:

  • tao evaluate → mAP@0.25 = 89.6%
  • tao inference → mAP@0.25 = 82.0%

  • tao evaluate → mAP@0.4 = 88.0%
  • tao inference → mAP@0.4 = 80.7 %

  • tao evaluate → mAP@0.5 = 84.5%
  • tao inference → mAP@0.5 = 77.8 %

  • tao evaluate → mAP@0.6= 74%
  • tao inference → mAP@06 = 68.4%

  • tao evaluate → mAP@0.75 = 30.0%
  • tao inference → mAP@0.75 = 27.0%

  • tao evaluate → mAP@0.95 = 0.0%
  • tao inference → mAP@0.95 = 0.0%

Looks like tao inference + script is always lower accuracy. Is it not possible to get the predictions from tao evaluate somehow, in a log file?

Thanks for the info.
The tlt evaluate does not print the predicted labels for end users.

We will check further.
Which dataset did you train? Is it possible to share your tlt model and training spec file?

We have trained on our custom person dataset which is a private dataset (we can’t share it).
Yes I can share the training spec file. Do you still want the tlt file even if I won’t share the test set?

Currently not. We will use a tlt model which is trained with KITTI dataset to check what is happening.

Can you help change below threshold and re-run “tao inference” + yourscript @0.5?
For example, try 0.2, 0.1, 0.05, 0.01

  • -t, --draw_conf_thres : Threshold for drawing a bbox. default: 0.3.

Ok I did what you ask. At first it didn’t appear to me that would be useful because it is a drawing parameter which should only affect output images and not the saved predictions in the text files. But here are the results:

  • draw_conf_threshold = 0.3.
    • tao inference + our scriptmAP@0.5 = 77.8%
  • draw_conf_threshold = 0.2
    • tao inference + our scriptmAP@0.5 = 79.1%
  • draw_conf_threshold = 0.1
    • tao inference + our scriptmAP@0.5 = 80.7%
  • draw_conf_threshold = 0.05
    • tao inference + our scriptmAP@0.5 = 81.8%
  • draw_conf_threshold = 0.01
    • tao inference + our scriptmAP@0.5 = 83.5%
  • draw_conf_threshold = 0.001
    • tao inference + our scriptmAP@0.5 = 83.9%

It is definitely increasing and I was really surprised! With a low threshold we are getting close to tao evaluate *mAP@0.5 = 84.5%

I’ve also compared the number of predictions in the output .txt files (labels folder). For draw_conf_threshold = 0.3 we get 27712 predictions and for draw_conf_threshold = 0.001 , we get 174423 prediction.
At the end the increase in accuracy makes sense, because when we compute mAP with precision/recall curve we use every predictions we have as points, there is no threshold on confidence used. And especially the mAP with PASCAL 2010 is using every points in the curve. If we give it ~6-7 times less predictions (points) the curve is definitely different.


  • Why a supposedly rendenring parameter, is actually affecting the number of saved predictions? It doesn’t make sense to me at all. It should be a confidence threshold not a drawing threshold.
  • Was this threshold added to tao during the change from tlt?
  • What is the confidence threshold used by tao evaluate ? (this one can’t be configured in the config file).


Yes, the “-t” is confidence threshold to draw/label a bbox. It will affect the number of predictions.
No, it is not newly added by tao.
In tao evaluate, it is defined in your training spec file. It is 0.01 by default. See YOLOv4 — TAO Toolkit 3.0 documentation

I’ve updated the config_threshold in the NMS config to 0.005, to match the one in our evaluation script.
Here are the final results:

  • Pascal 2010-2012 (average_precision_mode = INTEGRATE)

    • tao evaluatetmAP@0.5 = 84.1%
    • tao inference + our scriptmAP@0.5 = 83.9%
  • Pascal 2008 (average_precision_mode = SAMPLE)

    • tao evaluatemAP@0.5 = 81.8%
    • tao inference + our scriptmAP@0.5 = 81.5%

We see now we get the same accuracies with both tao evaluate and tao inference + our script so the problem is solved. I still find it pretty confusing to call a confidence threshold a draw_conf_thres but anyway.
Thanks your help Morgan!