TLT3.0 Container setup

harryhsl8c · February 28, 2021, 1:37pm

With TLT1.0 and TLT2.0, the way to call tlt-train, tlt-prune, tlt-export, etc was by starting up the TLT container and then running these commands in a Jupyter Notebook or via command line.

I haven’t personally tested TLT3.0 yet, but from the documentation it seems like TLT3.0 is used now as a Python Package (TLT Launcher), and any call to tlt train, tlt prune, tlt export automatically starts a container with necessary dependencies and with a specific entrypoint.

From: https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/tlt_launcher.html:
The tasks are broadly divided into computer vision and conversational AI. For example, DetectNet_v2 is a computer vision task for object detection in TLT which supports subtasks such as train , prune , evaluate , export etc. When the user executes a command, for example tlt detectnet_v2 train --help , the TLT launcher does the following:

Pulls the required TLT container with the entrypoint for DetectNet_v2

Creates an instance of the container

Runs the detectnet_v2 entrypoint with the train sub-task

My questions relate to executing TLT tasks from within the same container. To give you an example, if we want to train a model with TLT2.0 in a cloud GPU environment, we can just start the TLT2.0 container on the VM and execute all steps of the training pipeline from that same container. However, with 3.0, since each command seems to instantiate and run a new container, we are presented with the challenge of either running a container in a container, or handling data persistence of large datasets across multiple short-lived containers.

Would it be possible to run the TLT launcher from within a custom container? This would mean we’d be launching the individual TLT containers inside the custom container.

OR, is it possible for us to manually start the TLT container (with no entrypoint) rather than launching it from the TLT launcher? In this setup, we could then run commands within this TLT container as normal, just like we currently do with TLT2.0 and TLT1.0

Thanks and I look forward to continued developments with the TLT/DS workflow!

Morganh · February 28, 2021, 3:45pm

Yes, you can. For docker in docker , you can find some tips in https://docs.nvidia.com/datacenter/cloud-native/playground/dind.html#docker-in-docker

We strongly recommend you follow the steps mentioned in TLT launcher. TLT Launcher — Transfer Learning Toolkit 3.0 documentation
If you really want to login into the docker to debug something, for example, you can run tlt ssd run /bin/bash to have a check.

harryhsl8c · February 28, 2021, 3:52pm

Ok I’ll take a look at both options. The reason I’m asking is because if we wanted to train/prune/retrain/export all in a serverless cloud environment, if we don’t have 1 persistent container, then the training data would need to be downloaded twice, into both the train and re-train containers, which is inefficient. But if we could run all commands from the same container, it makes the data pipeline more efficient.

Morganh · February 28, 2021, 4:01pm

All the training data can be put into one path. Then create ~/.tlt_mount.json to map it to the path inside the docker.
See TLT Launcher — Transfer Learning Toolkit 3.0 documentation

Topic		Replies	Views
Issue about training a detectnet_v2 in TLT 3.0 using all gpus inside a docker container TAO Toolkit	10	974	September 27, 2021
Run TLT inside docker TAO Toolkit	9	1566	August 27, 2021
Transfer Learning Toolkit for Jetson Nano TAO Toolkit	14	1314	October 12, 2021
Problem about installing TLT TAO Toolkit	9	1306	October 12, 2021
Problem With TLT 3.0 Container Stopping TAO Toolkit	4	830	October 12, 2021
Instructions/Guide/Tutorials to run TLT 3 on any cloud platform TAO Toolkit	2	907	October 12, 2021
TLT training output and logs TAO Toolkit ai-training	14	1236	October 12, 2021
The input device is not a TTY TAO Toolkit	19	1688	October 12, 2021
Tlt detectnet_v2 TAO Toolkit	5	669	October 12, 2021
TLT Launcher show docker.errors.DockerException: Error while fetching server API version TAO Toolkit	6	1982	October 12, 2021

TLT3.0 Container setup

Related topics