ImportError: No module named nvml

I receive the error below (no module named nvml) when I run the tlt-dataset-convert command in the yolo example. Deepstream 5 works fine on this box outside the container.

Creating a new directory for the output tfrecords dump.

!mkdir -p $USER_EXPERIMENT_DIR/tfrecords
#KITTI trainval
!tlt-dataset-convert -d $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

************Output *********************
Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/tlt-dataset-convert”, line 5, in
from iva.detectnet_v2.scripts.dataset_convert import main
File “./detectnet_v2/scripts/dataset_convert.py”, line 14, in
File “./detectnet_v2/dataio/build_converter.py”, line 13, in
File “./detectnet_v2/dataio/kitti_converter_lib.py”, line 22, in
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 20, in
File “./detectnet_v2/dataloader/utilities.py”, line 15, in
File “/usr/local/lib/python2.7/dist-packages/modulus/init.py”, line 8, in
from modulus import blocks
File “/usr/local/lib/python2.7/dist-packages/modulus/blocks/init.py”, line 22, in
from modulus.blocks import data_loaders
File “/usr/local/lib/python2.7/dist-packages/modulus/blocks/data_loaders/init.py”, line 9, in
from modulus.blocks.data_loaders.sqlite_dataloader import SQLiteDataLoader
File “./modulus/blocks/data_loaders/sqlite_dataloader.py”, line 10, in
File “/usr/local/lib/python2.7/dist-packages/modulus/dataloader/init.py”, line 8, in
from modulus.dataloader import humanloop
File “./modulus/dataloader/humanloop.py”, line 16, in
File “/usr/local/lib/python2.7/dist-packages/modulus/processors/init.py”, line 26, in
from modulus.processors.buffers import NamedTupleStagingArea
File “./modulus/processors/buffers.py”, line 10, in
File “/usr/local/lib/python2.7/dist-packages/modulus/hooks/init.py”, line 9, in
from modulus.hooks.hooks import KerasCheckpointListener
File “./modulus/hooks/hooks.py”, line 26, in
ImportError: No module named nvml

Which tlt docker did you run?
Please refer to No module named nvml using tlt-dataset-convert and No module named nvml using tlt-dataset-convert

When I execute this command now I get another error。

Creating a new directory for the output tfrecords dump.

!mkdir -p $USER_EXPERIMENT_DIR/tfrecords
!cat $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
#KITTI trainval
!tlt-dataset-convert -d $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: “/workspace/tlt-experiments/data/training”
Using TensorFlow backend.
2020-10-15 14:44:02,906 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-10-15 14:44:02,915 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 2368 Val: 385
2020-10-15 14:44:02,915 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-10-15 14:44:02,916 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
Traceback (most recent call last):
File “/usr/local/bin/tlt-dataset-convert”, line 8, in
sys.exit(main())
File “./detectnet_v2/scripts/dataset_convert.py”, line 64, in main
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 74, in convert
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 108, in _write_partitions
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 149, in _write_shard
File “./detectnet_v2/dataio/kitti_converter_lib.py”, line 169, in _create_example_proto
File “./detectnet_v2/dataio/kitti_converter_lib.py”, line 290, in _add_targets
AttributeError: ‘int’ object has no attribute ‘lower’

The format of my label data set is as follows:
0 0.00 0 0.00 1168.0 121.0 1751.0 403.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The first field of a label file should be string instead of int.
See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#label_file

If training is interrupted, can I continue training where I left off? For example, training iterates 200 times, and suddenly stops at 45 times. I want to continue training from 45 times.

Yes, you can.
From tlt user guide,

Note: DetectNet_v2 now supports resuming training from intermediate checkpoints. In case a previously running training experiment is stopped prematurely, one may restart the training from the last checkpoint by simply re-running the detectnet_v2 training command with the same command line arguments as before. The trainer for detectnet_v2 finds the last saved checkpoint in the results directory and resumes the training from there.

For example, if you stop at 22th epoch, then resume training via below.

!tlt-train detectnet_v2 -e spec.txt
-r experiment_dir_unpruned
-k $KEY
-m 022.tlt
–gpus 1
–initial_epoch 23

Thank you, I want to use darknet53 as the backbone network of yolov3. I changed the arch in yolo_train_resnet18_kitti.txt to darknet53, which resulted in a dimension mismatch error.

Please create a new topic and attach your full training spec there.

txt file upload failed, I uploaded a screenshot.

For your latest issue, let us track in topic Use darknet53 as backbone network for yolov3