No problem. And actually I cannot reproduce this kind of issue all the time.
But some end users meet this problem.
So, could you share more info about the reproduce step and training environment?
- The cuda/trt version in your local PC
- The jupyter notebook . You can upload the .ipynb file here.
- jupyter notebook
attached. it is yolo_v4.ipynb in cv_samples_vv1.2.0, which was downloaded from ngc.
I did not follow the notebook literally. My network is not stable. So instead of retrying download the 12G image file for days, I downloaded and extracted the images from a mirror inside my country, and from outside of the notebook.
I also ssh to the machine that running the notebook. X11 problems bothered me. So I change !tao yolo_v4 train to !echo tao … and execute the command in terminal.
yolo_v4.ipynb (188.0 KB)
command line history:
annotated-cmd-history.txt (6.2 KB)
both cuda-10-2 and cuda-11-1 installed, /usr/local/cuda links to 11-1 ultimately.
- trt 7.2.2-1+cuda11.1
As I tried to use darknet_xxx.weights with deepsteam 4.5.1, I have replace libnvinfer_plugin.so.7.2.2 with the version built with TensorrtOSS 7.2.2.
- os: ubuntu 18.04, nvidia related package installed from nvidia cuda/machine-learning repo.
- gpu: rtx2080 ti
- driver: 495.44
- cudnn 18.104.22.168-1+cuda11.1
As I am using tao toolkit, the docker image may be useful. It is nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3. It has:
- trt 7.2.3
- cudnn 8.1.1
To narrow down, could you refer to my comment in TLT YOLOv4 (CSPDakrnet53) - TensorRT INT8 model gives wrong predictions (0 mAP) - #23 by Morganh and use “yolo_v4 export” to generate tensorrt engine directly and retry?
yolo_v4 export -k nvidia_tlt -m epoch_010.tlt -e spec.txt --engine_file 384_1248.engine --data_type int8 --batch_size 8 --batches 10 --cal_cache_file export/cal.bin --cal_data_file export/cal.tensorfile --cal_image_dir /kitti_path/training/image_2 -o 384_1248.etlt
I missed your point of generating engine directly in my last reply. So I deleted it.
But still no box. By the way, there are odds (1 in 5?)the command will fail with illegal memory access. I use the following command:
tao yolo_v4 export -k nvidia_tlt -m /workspace/tao-experiments/yolo_v4/experiment_dir_retrain/weights/yolov4_resnet18_epoch_060.tlt -e /workspace/tao-experiments/yolo_v4/specs/yolo_v4_retrain_resnet18_kitti.txt --engine_file /workspace/tao-experiments/yolo_v4/export/trt.engine --data_type int8 --batch_size 8 --batches 10 --cal_image_dir /workspace/tao-experiments/yolo_v4/data/training/image_2 --cal_cache_file /workspace/tao-experiments/yolo_v4/export/cal.bin --cal_data_file /workspace/tao-experiments/yolo_v4/export/cal.tensorfile -o /workspace/tao-experiments/yolo_v4/export/yolov4_resnet18_epoch_060.etlt
And the error is:
[TensorRT] ERROR: engine.cpp (984) - Cuda Error in executeInternal: 700 (an illegal memory access was encountered)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
[TensorRT] INTERNAL ERROR: Assertion failed: context->executeV2(&bindings)
Can you directly login the docker and run again?
$ tao yolo_v4 run /bin/bash
Then inside the docker,
#tao yolo_v4 export xxx
Inside the docker, still no box.
root@f88186999295:/workspace/tao-experiments/yolo_v4# yolo_v4 export --engine_file ...
root@f88186999295:/workspace/tao-experiments/yolo_v4# yolo_v4 inference ...
Please try to run inside the docker again with my cal.bin. Thanks.
cal.bin.txt (8.4 KB)
Your cal.bin does the trick. Now I got mAP 0.90074. Thank you!
My purpose is train the model using custom dataset. Can I use your cal.bin when the .tlt model is ready?
Comparing to local generated cal.bin, this time there are lots of warning like:
[WARNING] Missing dynamic range for tensor (Unnamed Layer* 306) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor activation_2/Relu:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
So, the issue results from cal.bin file.
No, we should generate different cal.bin for different dataset.
OK, thanks. Wish the bug be fixed in next release.
No, this is still not a bug. I also use the same way to generate cal.bin.
So, it should be related to the environment.
Can you attach your cal.bin ?