Please provide the following information when requesting support.
• Hardware (T4)
• Network Type (resnet18)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (!tao model yolo_v4 dataset_convert -d $SPECS_DIR/yolo_v4_tfrecords_kitti_train.txt
-o $DATA_DIR/train/tfrecords/train.)
Hello,
This is my first time of using TaoToolkit. My goal is to do transfer learning using Yolo_V4. So I open the link for theYoloV4 object detection. I was able to run the until step 2.1 -setup python environment. I was step#3 when I execute the following code it resulted to the error -ModuleNotFoundError: No module named ‘uff’ : Yolo_V4_Colab_Error…txt (2.7 KB)
!tao model yolo_v4 dataset_convert -d $SPECS_DIR/yolo_v4_tfrecords_kitti_train.txt
-o $DATA_DIR/train/tfrecords/train
After the error occurred I needed to close the colab notebook to conserve my session usage, so the logs are already gone. Anyway, running step2.0 took quite a long time since tensor RT needs to be installed plus tensor flow and it’s dependencies. I was able to complete the install but since the log file was quite long it got clipped in the colab so I was not able to view what dependencies/modules were not installed properly. I only got the module error.
Ok, I’m re-running again the colab notebook. But I’m a bit confused why tensorRT needs to be installed? There’s a youtube video by Nvidia regarding running Tao ToolKit on colab, I did not see that tensorRT being installed. Also my purpose was to do transfer learning and not to do any inference. I plan to do the inference on my Jetson Orin NX. Anyway I’ll send you the copy of the notebook with the captured logs.
I removed all the files/folders generated using step#2 of the notebook and tried a fresh install. The error - No module named ‘uff’ did not appear. But when executing the code for converting the kitti formated annotation files to TFREcords newset of warnings and error appeared :
Using TensorFlow backend.
2024-03-10 19:20:19,389 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-03-10 19:20:20,360 [TAO Toolkit] [WARNING] root 329: Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2024-03-10 19:20:20,920 [TAO Toolkit] [WARNING] root 329: Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2024-03-10 19:20:22,837 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.trt_utils 36: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
2024-03-10 19:20:22,838 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.base_exporter 44: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Using TensorFlow backend.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataio/dataset_converter_lib.py:181: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataio/dataset_converter_lib.py:181: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: PASS
The warning message highlights missing TensorBoard install, not sure if that affected the error. Another warning related to no attribute decode. I assume this attribute should be on the generated TFRecords.
The notebook details another method of using the Kitti annotation files instead of the TFRecords:
The default YOLOv4 data format requires generation of TFRecords. Currently, the old sequence data format (image folders and label txt folders) is still supported and if you prefer to use the sequence data format, you can skip this section. To use sequence data format, please use spec file yolo_v4_train_resnet18_kitti_seq.txt and yolo_v4_retrain_resnet18_kitti_seq.txt
The only thing is there is no yolo_v4_train_resnet18_kitti_seq.txt and yolo_v4_retrain_resnet18_kitti_seq.txt on the SPECS DIR. Another more concerning warning is : nvidia_tao_tf1.cv.common.export.base_exporter 44: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
As I plan to do the inference on my Jetson Orin NX I would need to convert the model TLT file to TensorRT engine file. However with the warning message this would defeat my purpose 100%. Need help to find why it failed to import TensotRT. Copy of yolo_v4.zip (57.4 KB)
Yes I can see the output tfrecords on my data folder. But I have some doubts on these data as some have 0kb size and there is no file extension. I’m trying to figure-out how to view these files to confirm if something got corrupted.
I’ll try to re-run again the training but I’m also worried about the TensorRT warning of not being able to convert TLT to engine.
I have 210 training images and 90 validation images. I think I found the issue why some of the shard files have 0kb. Somy of my images don’t have matching extensions like *.jpg and *.JPG. Because the spec file is looking for *.jpg it ignored most of the images with 8.JPG. I have fixed this issue and my images now have standard extensions. Now I tried to execute the TFRecords conversion. It took some 5 minutes to finish but resulted with the same error as before. I also tried to import tensorrt but got an error → No module names ‘tensorrt’. I’m not sure why it’s not importing as I checked tensorrt is installed at this path - !tar -xzf $trt_tar_path -C /content/trt_untar.
Warning and Errors: Using TensorFlow backend.
2024-03-10 18:09:09.265904: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-03-10 18:09:09,319 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-03-10 18:09:10,325 [TAO Toolkit] [WARNING] root 329: Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2024-03-10 18:09:10,903 [TAO Toolkit] [WARNING] root 329: Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2024-03-10 18:09:12,898 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.trt_utils 36: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
2024-03-10 18:09:12,898 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.base_exporter 44: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Using TensorFlow backend.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: PASS
I was able to proceed with the training. Although I did not expect to see very high loss starting from 26,000 down to 482 after 100 epoch. My main issue I’m facing is that only 1 out of 2 classes is getting prediction and I’m not sure why. I followed the recommended dataset formatting/folder layout. My Training Images are slightly mismatch: ‘awake’ - 103 images , ‘drowsy’ - 107. My validation images are 50 each. I combined all the training images into 1 folder named images w/ the following format → awake_N.jpg, drowsy_N.jpg.
I’m checking the settings on yolo_v4_train_resnet18_kitti.txt, but could not find anything wrong that would cause this issue. I have uploaded the training config file.
This is a new question. Could you generate a new forum topic? Because the original issue is gone now.
For this latest question, did you have the log when you generate the tfrecords files /content/drive/MyDrive/kitti_data/DATA_DIR/train/tfrecords/train*?
In the log, we can see how many “awake” objects and how many “drowsy” objects.
Since you set validation_fold: 0, the evaluation will use the tfrecords files which has -000-of in the filename.
Thanks for your inputs on the validation_fold: 0. I tried to replace this setting with paths for my validation TFRecords and images instead and it worked and I’m getting predictions for both of my 2 classes. Although the loss seems a bit high. Based on the forum discussion on Yolo_V4 the high loss seems to be normal. Not sure what would be the impact on the inference by I’ll try the model.