TLT for jetson nano with jetpack 4.5 classification notebook

I’m trying to run the example TLT classification notebook from https://api.ngc.nvidia.com/v2/resources/nvidia/tlt_cv_samples/versions/v1.0.2/zip -O tlt_cv_samples_v1.0.2.zip.
Trying to run:
“!ngc registry model list nvidia/tlt_pretrained_classification:*”
I get the following error:
/bin/bash: /home/nvidia/tlt_cv_samples_v1.0.2/ngccli/ngc: cannot execute binary file: Exec format error

Can anyone help me with this? What information do you need to solve it?

Okay so the notebook doesnt seem to be meant for jetson because it installs the wrong ngccli. I did so and now I’m trying to train the model with:
!tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY
but it gives me errors:
2021-04-30 14:08:23,431 [INFO] root: Registry: [‘nvcr.io’]
2021-04-30 14:08:23,733 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.
2021-04-30 14:08:23,734 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.

Repository name: nvcr.io/nvidia/tlt-streamanalytics
Docker pull failed. 404 Client Error: Not Found (“manifest for nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 not found: manifest unknown: manifest unknown”)

So what I dont understand is why is it now trying to pull a docker? Is it possible the problem is in the config file?

Please trigger jupyter notebook in your host PC instead of nano.
More, please see TLT Launcher — Transfer Learning Toolkit 3.0 documentation

Log in to the NGC docker registry ( nvcr.io ) with an API key that you can generate from ngc.nvidia.com .

docker login nvcr.io

Hi @Morganh,
Thanks for your prompt reply.
I am also facing the same problem here. When I simply try to run tlt detectnet_v2 --help, it shows the same problem.

Problem
The output is as follows:
`(launcher) redwan@Desktop:~$ tlt detectnet_v2 --help
2021-05-01 00:08:49,806 [INFO] root: Registry: [‘nvcr.io’]

2021-05-01 00:08:49,889 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.

2021-05-01 00:08:49,889 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.


Repository name: nvcr.io/nvidia/tlt-streamanalytics

Docker pull failed. 404 Client Error: Not Found (“manifest for nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 not found: manifest unknown: manifest unknown”)
`
Root cause:

  • The docker is actually looking to download docker-repo named nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 which ironically doesn’t exist. The closest possible docker-repo that exists is nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3 in the official repo https://ngc.nvidia.com/catalog/containers/nvidia:tlt-streamanalytics/tags
  • I don’t know how should I tell my machine to look for repo with tag v3.0-dp-py3 instead of v3.0-py3.

Thanks in advance…

I had the same problem. Here is a temporary workaround:

$ nvidia-docker tag nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3 nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3

Can you run

tlt info

and paste its result?

Thank you very much for your reply. I did the Login to nvcr.io anyways and I followed the documentation in the link also.
Now why do I need to trigger the jupyter notebook in my host PC instead of nano?
Oh and tlt info gives me:
“Configuration of the TLT Instance
dockers: [‘nvidia/tlt-streamanalytics’, ‘nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021”

@fpruess

For your above issue “Exec format error”,it results from the different platform you are running. Currently you are running at Nano instead of host PC. For your case, please download arm64 version of ngc. See Requirements and Installation — Transfer Learning Toolkit 3.0 documentationNVIDIA NGC

wget -O ngccli_cat_arm64.zip https://ngc.nvidia.com/downloads/ngccli_cat_arm64.zip && unzip -o ngccli_cat_arm64.zip && chmod u+x ngc

Usually, end user run TLT training and trigger notebook in host PC instead of Jetson platform. See the requirement at Requirements and Installation — Transfer Learning Toolkit 3.0 documentation or Requirements and Installation — Transfer Learning Toolkit 3.0 documentation

@redwankarimsony1455 @frederikschoeller
Thanks for your finding. There should be some regression issue for the tag. Currently, only v3.0-dp-py3 is released. If you run tlt info --verbose and find that the tag is not v3.0-dp-py3, please follow @frederikschoeller 's workaround to fix. I will sync with internal team too.

@Morganh

Thank you again! I’m sorry I’m a total beginner and thats probably the problem here… So my stupid question: How do I run it on my host PC? Do I just completely do it all on my normal laptop or do I connect the Nano to my laptop? And then?
Besides that I pulled the correct image (v3.0-dp-py3), but when I try running @frederikschoeller 's command it works but the tag in tlt info --verbose doesn’t change. And also I’m now getting a complete different error when trying to run the training with “!tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY” probably caused by pulling the image. The error message is:
2021-05-03 10:05:28,079 [INFO] root: Registry: [‘nvcr.io’]
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown”

Thank you very much in advance and I’m really sorry about all these stupid questions.

@fpruess
See Requirements and Installation — Transfer Learning Toolkit 3.0 documentation

The TLT is designed to run on x86 systems with an NVIDIA GPU (e.g., GPU-powered workstation, DGX system) or can be run in any cloud with an NVIDIA GPU. For inference, models can be deployed on any edge device such as an embedded Jetson platform or in a data center with GPUs like T4 or A100.

So, when you run TLT training, it is recommended to run at your host PC or cloud instance.
After training, please copy etlt model to your host PC or your Nano to run inference, or generate trt engine directly in your host PC or your Nano to run inference.

For above error, where did you run, in your nano?

For your latest error, please refer to Error while running TLT Docker for help.

Yes I have seen the requirements and I don’t have any host PC with ubuntu installed. Trying to make it work on Windows seems even more complicated. I am a total beginner in software development and unable to find a tutorial, which I can get to work. All I actually wanted to do was train a network for classification using transfer learning and then run it on the Nano using Deepstream. But I don’t seem to get anything to work. Just stumbling from error to error and not updated documentations.
If someone can recommend a tutorial, which includes all steps, I would be so thankful. It can’t possibly be that hard. I have been trying for weeks now.

If you have not any host PC with ubuntu installed, I’m afraid you can train on Cloud, such as GCP or AWS.
For how to set up GCP or AWS, please search similar topics in the TLT forum. Such as, Train instance segmentation model in google cloud instance with custom data and run inference on Jetson AGX Xavier with TensorRT

TLT 3.0 user guide describes the steps at
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/requirements_and_installation.html
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/tlt_launcher.html
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/tlt_cv_inf_pipeline/requirements_and_installation.html