Issue about training a detectnet_v2 in TLT 3.0 using all gpus inside a docker container

leo2105 · July 19, 2021, 2:24pm

Please provide the following information when requesting support.

• Hardware : Server of V100s
• Network Type : Detectnet_v2
• TLT Version : v3.0-py3

Hi, I have a problem that tlt train use all gpus to train a model when I only specified two of them.

Replicate:
I am running TLT 3.0 inside a container docker only using device 13 and 14., which was run by this command:

docker run -d --name tlt-leo_detectnet_v2 -it --rm --gpus '"device=13,14"' -p 4444:4444 -v "/home.nfs/leonardo.vera/tlt-train":"/home.nfs/" -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash

docker exec -it tlt-leo_detectnet_v2 bash

I run all the steps described in the detectnet_v2 notebook.
When I get to tlt train I use this command:

tlt detectnet_v2 train -e {}/detectnet_v2_train_resnet18_kitti.txt -r {}/experiment_dir_unpruned -k tlt_encode -n resnet18_detector_unpruned --gpus 2 --gpu_index 13 14".format(specs_tlt_path, user_exp_dir_path)

The problem starts here:

We got two processes training the model (3261216 and 3261217), perfect, but idk why there is another one (3260670) which its using all the gpus even 307MiB of the first one.

Another question related, In theory if I only pass device 13 and 14, Idk why tlt train doesn’t run with --gpu_index 0,1 instead of that it works with index 13 and 14.

Maybe all of these problem is because of -v path/to/docker:path/to/docker that I pass at the beginning Idk. I’d like to know if you can suggest another alternative because If I don’t pass docker path one, it appears another problem like this one.

Thanks in advance.

Morganh · July 20, 2021, 2:46am

When you login the docker via “docker exec -it tlt-leo_detectnet_v2 bash”, please run command like “detectnet_v2 train” instead of “tlt detectnet_v2 train”.
In notebook, we assume end user is running via tlt-launcher from host PC instead of inside the docker.
So if end user install the tlt-launcer, then run “tlt detectnet_v2 train” directly to trigger any task.

But currently, you already login the docker via “docker run xxx” and “docker exec xxx”, so please run the commands like “detectnet_v2 train xxx” etc.

leo2105 · July 21, 2021, 11:35am

It looks its working but I have problems to find the real path of all spec_file or the files.
When I ran with tlt detectnet_v2, I used tlt detectnet_v2 run bash to find the real path, but detectnet_v2 run bash is not working. What do you suggest?

Morganh · July 21, 2021, 11:53am

In your host PC, yes, you can run tlt detectnet_v2 run bash. Why will you run detectnet_v2 run bash?

leo2105 · July 21, 2021, 12:17pm

Because when I am running

detectnet_v2 train -e /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt  \
                              -r /workspace/tlt-experiments/detectnet_v2/experiment_dir_unpruned \
                              -k tlt_encode -n resnet18_detector_unpruned \
                              --gpus 2 \
                              --gpu_index 0 1

there is an error about detectnet_v2_train_resnet18_kitti.txt not found. I tried many paths and not working. With tlt detectnet_v2 run bash I found it easily in the past.
Is there a way to find the correct path?

Morganh · July 21, 2021, 1:40pm

Can you run below command to check if the txt file is available?
$ tlt detectnet_v2 run ls /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt

leo2105 · July 21, 2021, 3:32pm

Actually, I used local variables, for example, /home.nfs/detectnet_v2/specs_data/detectnet_v2_train_resnet18_kitti.txt, and now it works perfectly. I think that it is. I trained the model successfully.

Let me ask you something, since now, it’s not necessary to create mount points and define other variables, if you use a docker container right? Same thing to TLT 2.0 I suppose

Morganh · July 21, 2021, 3:45pm

Yes, you can try this which is the same as TLT2.0.
But for TLT 3.0, it is recommended to use tlt-launcher, create ~/.tlt_mounts.json for mapping.

leo2105 · July 21, 2021, 3:49pm

Last one, about the enviroment (virtualenv and virtualenvwrapper)is it necessary if I am the only user in my PC?

Morganh · July 21, 2021, 4:05pm

It is not a must but it is recommended.

system · September 27, 2021, 1:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Run TLT inside docker TAO Toolkit	9	1566	August 27, 2021
TLT 3 error running detectnet_v2 dataset_convert TAO Toolkit	4	1628	September 27, 2021
Tlt lprnet can't find the spec file TAO Toolkit	12	1348	October 12, 2021
TLT3.0 Container setup TAO Toolkit	4	409	October 12, 2021
Problem about installing TLT TAO Toolkit	9	1306	October 12, 2021
The input device is not a TTY TAO Toolkit	19	1688	October 12, 2021
Problem With TLT 3.0 Container Stopping TAO Toolkit	4	830	October 12, 2021
Instructions/Guide/Tutorials to run TLT 3 on any cloud platform TAO Toolkit	2	907	October 12, 2021
TLT for jetson nano with jetpack 4.5 classification notebook TAO Toolkit	14	909	October 12, 2021
Running tlt- docker.errors.DockerException: Error while fetching server API version TAO Toolkit	16	3658	August 28, 2021

Issue about training a detectnet_v2 in TLT 3.0 using all gpus inside a docker container

Related topics