TLT3.0 Container setup

With TLT1.0 and TLT2.0, the way to call tlt-train, tlt-prune, tlt-export, etc was by starting up the TLT container and then running these commands in a Jupyter Notebook or via command line.

I haven’t personally tested TLT3.0 yet, but from the documentation it seems like TLT3.0 is used now as a Python Package (TLT Launcher), and any call to tlt train, tlt prune, tlt export automatically starts a container with necessary dependencies and with a specific entrypoint.

From: https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/tlt_launcher.html:
The tasks are broadly divided into computer vision and conversational AI. For example, DetectNet_v2 is a computer vision task for object detection in TLT which supports subtasks such as train , prune , evaluate , export etc. When the user executes a command, for example tlt detectnet_v2 train --help , the TLT launcher does the following:

  1. Pulls the required TLT container with the entrypoint for DetectNet_v2
  2. Creates an instance of the container
  3. Runs the detectnet_v2 entrypoint with the train sub-task

My questions relate to executing TLT tasks from within the same container. To give you an example, if we want to train a model with TLT2.0 in a cloud GPU environment, we can just start the TLT2.0 container on the VM and execute all steps of the training pipeline from that same container. However, with 3.0, since each command seems to instantiate and run a new container, we are presented with the challenge of either running a container in a container, or handling data persistence of large datasets across multiple short-lived containers.

Would it be possible to run the TLT launcher from within a custom container? This would mean we’d be launching the individual TLT containers inside the custom container.

OR, is it possible for us to manually start the TLT container (with no entrypoint) rather than launching it from the TLT launcher? In this setup, we could then run commands within this TLT container as normal, just like we currently do with TLT2.0 and TLT1.0

Thanks and I look forward to continued developments with the TLT/DS workflow!

Yes, you can. For docker in docker , you can find some tips in https://docs.nvidia.com/datacenter/cloud-native/playground/dind.html#docker-in-docker

We strongly recommend you follow the steps mentioned in TLT launcher. TLT Launcher — Transfer Learning Toolkit 3.0 documentation
If you really want to login into the docker to debug something, for example, you can run tlt ssd run /bin/bash to have a check.

Ok I’ll take a look at both options. The reason I’m asking is because if we wanted to train/prune/retrain/export all in a serverless cloud environment, if we don’t have 1 persistent container, then the training data would need to be downloaded twice, into both the train and re-train containers, which is inefficient. But if we could run all commands from the same container, it makes the data pipeline more efficient.

All the training data can be put into one path. Then create ~/.tlt_mount.json to map it to the path inside the docker.
See TLT Launcher — Transfer Learning Toolkit 3.0 documentation