ImportError: No module named nvml

I receive the error below (no module named nvml) when I run the tlt-dataset-convert command in the yolo example. Deepstream 5 works fine on this box outside the container.

Creating a new directory for the output tfrecords dump.

!mkdir -p $USER_EXPERIMENT_DIR/tfrecords
#KITTI trainval
!tlt-dataset-convert -d $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

************Output *********************
Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/tlt-dataset-convert”, line 5, in
from iva.detectnet_v2.scripts.dataset_convert import main
File “./detectnet_v2/scripts/”, line 14, in
File “./detectnet_v2/dataio/”, line 13, in
File “./detectnet_v2/dataio/”, line 22, in
File “./detectnet_v2/dataio/”, line 20, in
File “./detectnet_v2/dataloader/”, line 15, in
File “/usr/local/lib/python2.7/dist-packages/modulus/”, line 8, in
from modulus import blocks
File “/usr/local/lib/python2.7/dist-packages/modulus/blocks/”, line 22, in
from modulus.blocks import data_loaders
File “/usr/local/lib/python2.7/dist-packages/modulus/blocks/data_loaders/”, line 9, in
from modulus.blocks.data_loaders.sqlite_dataloader import SQLiteDataLoader
File “./modulus/blocks/data_loaders/”, line 10, in
File “/usr/local/lib/python2.7/dist-packages/modulus/dataloader/”, line 8, in
from modulus.dataloader import humanloop
File “./modulus/dataloader/”, line 16, in
File “/usr/local/lib/python2.7/dist-packages/modulus/processors/”, line 26, in
from modulus.processors.buffers import NamedTupleStagingArea
File “./modulus/processors/”, line 10, in
File “/usr/local/lib/python2.7/dist-packages/modulus/hooks/”, line 9, in
from modulus.hooks.hooks import KerasCheckpointListener
File “./modulus/hooks/”, line 26, in
ImportError: No module named nvml

Which tlt docker did you run?
Please refer to No module named nvml using tlt-dataset-convert and No module named nvml using tlt-dataset-convert

When I execute this command now I get another error。

Creating a new directory for the output tfrecords dump.

!mkdir -p $USER_EXPERIMENT_DIR/tfrecords
!cat $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
#KITTI trainval
!tlt-dataset-convert -d $SPECS_DIR/yolo_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
image_directory_path: “/workspace/tlt-experiments/data/training”
Using TensorFlow backend.
2020-10-15 14:44:02,906 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-10-15 14:44:02,915 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 2368 Val: 385
2020-10-15 14:44:02,915 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-10-15 14:44:02,916 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
Traceback (most recent call last):
File “/usr/local/bin/tlt-dataset-convert”, line 8, in
File “./detectnet_v2/scripts/”, line 64, in main
File “./detectnet_v2/dataio/”, line 74, in convert
File “./detectnet_v2/dataio/”, line 108, in _write_partitions
File “./detectnet_v2/dataio/”, line 149, in _write_shard
File “./detectnet_v2/dataio/”, line 169, in _create_example_proto
File “./detectnet_v2/dataio/”, line 290, in _add_targets
AttributeError: ‘int’ object has no attribute ‘lower’

The format of my label data set is as follows:
0 0.00 0 0.00 1168.0 121.0 1751.0 403.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The first field of a label file should be string instead of int.

If training is interrupted, can I continue training where I left off? For example, training iterates 200 times, and suddenly stops at 45 times. I want to continue training from 45 times.

Yes, you can.
From tlt user guide,

Note: DetectNet_v2 now supports resuming training from intermediate checkpoints. In case a previously running training experiment is stopped prematurely, one may restart the training from the last checkpoint by simply re-running the detectnet_v2 training command with the same command line arguments as before. The trainer for detectnet_v2 finds the last saved checkpoint in the results directory and resumes the training from there.

For example, if you stop at 22th epoch, then resume training via below.

!tlt-train detectnet_v2 -e spec.txt
-r experiment_dir_unpruned
-k $KEY
-m 022.tlt
–gpus 1
–initial_epoch 23

Thank you, I want to use darknet53 as the backbone network of yolov3. I changed the arch in yolo_train_resnet18_kitti.txt to darknet53, which resulted in a dimension mismatch error.

Please create a new topic and attach your full training spec there.

txt file upload failed, I uploaded a screenshot.

For your latest issue, let us track in topic Use darknet53 as backbone network for yolov3