Issue about training a detectnet_v2 in TLT 3.0 using all gpus inside a docker container

Please provide the following information when requesting support.

• Hardware : Server of V100s
• Network Type : Detectnet_v2
• TLT Version : v3.0-py3

Hi, I have a problem that tlt train use all gpus to train a model when I only specified two of them.

Replicate:
I am running TLT 3.0 inside a container docker only using device 13 and 14., which was run by this command:

docker run -d --name tlt-leo_detectnet_v2 -it --rm --gpus '"device=13,14"' -p 4444:4444 -v "/home.nfs/leonardo.vera/tlt-train":"/home.nfs/" -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash

docker exec -it tlt-leo_detectnet_v2 bash

I run all the steps described in the detectnet_v2 notebook.
When I get to tlt train I use this command:

tlt detectnet_v2 train -e {}/detectnet_v2_train_resnet18_kitti.txt -r {}/experiment_dir_unpruned -k tlt_encode -n resnet18_detector_unpruned --gpus 2 --gpu_index 13 14".format(specs_tlt_path, user_exp_dir_path)

The problem starts here:

We got two processes training the model (3261216 and 3261217), perfect, but idk why there is another one (3260670) which its using all the gpus even 307MiB of the first one.


Another question related, In theory if I only pass device 13 and 14, Idk why tlt train doesn’t run with --gpu_index 0,1 instead of that it works with index 13 and 14.

Maybe all of these problem is because of -v path/to/docker:path/to/docker that I pass at the beginning Idk. I’d like to know if you can suggest another alternative because If I don’t pass docker path one, it appears another problem like this one.

Thanks in advance.

When you login the docker via “docker exec -it tlt-leo_detectnet_v2 bash”, please run command like “detectnet_v2 train” instead of “tlt detectnet_v2 train”.
In notebook, we assume end user is running via tlt-launcher from host PC instead of inside the docker.
So if end user install the tlt-launcer, then run “tlt detectnet_v2 train” directly to trigger any task.

But currently, you already login the docker via “docker run xxx” and “docker exec xxx”, so please run the commands like “detectnet_v2 train xxx” etc.

It looks its working but I have problems to find the real path of all spec_file or the files.
When I ran with tlt detectnet_v2, I used tlt detectnet_v2 run bash to find the real path, but detectnet_v2 run bash is not working. What do you suggest?

In your host PC, yes, you can run tlt detectnet_v2 run bash. Why will you run detectnet_v2 run bash?

Because when I am running

detectnet_v2 train -e /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt  \
                              -r /workspace/tlt-experiments/detectnet_v2/experiment_dir_unpruned \
                              -k tlt_encode -n resnet18_detector_unpruned \
                              --gpus 2 \
                              --gpu_index 0 1

there is an error about detectnet_v2_train_resnet18_kitti.txt not found. I tried many paths and not working. With tlt detectnet_v2 run bash I found it easily in the past.
Is there a way to find the correct path?

Can you run below command to check if the txt file is available?
$ tlt detectnet_v2 run ls /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt

Actually, I used local variables, for example, /home.nfs/detectnet_v2/specs_data/detectnet_v2_train_resnet18_kitti.txt, and now it works perfectly. I think that it is. I trained the model successfully.

Let me ask you something, since now, it’s not necessary to create mount points and define other variables, if you use a docker container right? Same thing to TLT 2.0 I suppose

Yes, you can try this which is the same as TLT2.0.
But for TLT 3.0, it is recommended to use tlt-launcher, create ~/.tlt_mounts.json for mapping.

Last one, about the enviroment (virtualenv and virtualenvwrapper)is it necessary if I am the only user in my PC?

It is not a must but it is recommended.