TAO container fails on Google Vertex AI

mattcarp88 · October 9, 2021, 8:50am

Bro are you serious you haven’t once addressed the actual topic of this thread.

Morganh · October 9, 2021, 8:54am

Can you follow Running TAO Toolkit on Google Cloud Platform — TAO Toolkit 3.22.05 documentation in the Google Vertex AI ?

mattcarp88 · October 9, 2021, 8:55am

Can you explain the breaking change from tlt 2 to Tao 3?

mattcarp88 · October 9, 2021, 8:56am

I already answered that in a previous post. Maybe you should read through what we’ve discussed already??

Morganh · October 9, 2021, 8:57am

Please see Migrating to TAO Toolkit — TAO Toolkit 3.22.05 documentation .
And according to above comments we discussed, the error may result in nvidia-docker, so it is not related to the migration from tlt2 to tao3.

mattcarp88 · October 9, 2021, 8:59am

“the error may result in nvidia-docker” what does that mean?

Morganh · October 9, 2021, 9:01am

Above comment from you gives the hint. Please install nvidia-docker2.

mattcarp88 · October 9, 2021, 9:01am

So you can’t explain the breaking change??

Morganh · October 9, 2021, 9:01am

For the change between tlt2 and tao3, please see Migrating to TAO Toolkit — TAO Toolkit 3.0 documentation.

mattcarp88 · October 9, 2021, 9:03am

Was nvidia-docker2 not required for TLT 2? Is that the breaking change? Your link is not helpful to my inquiry.

Morganh · October 9, 2021, 9:06am

See Requirements and Installation — Transfer Learning Toolkit 2.0 documentation, it is needed.

mattcarp88 · October 9, 2021, 9:09am

I am asking for your help and guidance and you are just giving me links to documents I’ve already read.

Can you help me with this?

Morganh · October 9, 2021, 9:12am

I need more info from your side.
TLT 2.0 container + Google Vertex AI : Is it successful? Can you share more detail how did you trigger 2.0 container?

TAO 3.0 container + Google Vertex AI : Is it successful? Can you share more detail how did you trigger tao 3.0 container?

And did you ever try
tlt3.0 dp version container + Google Vertex AI ? See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
The tlt 3.0 dp version is released on 2/10/2021. docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

mattcarp88 · October 11, 2021, 12:55pm

To run a training job in Vertex AI I follow the documentation here: Create a model using custom training | Vertex AI | Google Cloud.

Basically I upload the container to the cloud, then launch a job with vertex ai telling it to use that container and what command to run inside the container.

I trigger the container the same way for both 2.0 and 3.0, the only thing that changes is the name of the container and the name of the commands to run, i.e. classification train instead of tlt-train classification.

Again, everything in my process is the same for 2.0 and 3.0 except for the container. But only 2.0 works and 3.0 does not work.

@Morganh can you think of what would cause this breaking change between 2.0 and 3.0?

mattcarp88 · October 11, 2021, 2:51pm

What is TLT 3.0 dp? Developer preview? Why would I use that instead of the latest version of TAO?

Morganh · October 11, 2021, 3:21pm

Yes, dp means developer-preview. Just to narrow down the issue when I asked.
For breaking change between 2.0 and 3.0, except Migrating to TAO Toolkit — TAO Toolkit 3.22.05 documentation, please note that the requirement is also changed.
For 2.0, see Requirements and Installation — Transfer Learning Toolkit 2.0 documentation , Install NVIDIA GPU driver v410.xx or above.

For latest tao,
See TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.05 documentation

Software	Version
Ubuntu 18.04 LTS	18.04
python	>=3.6.9
docker-ce	>19.03.5
docker-API	1.40
`nvidia-container-toolkit`	>1.3.0-1
nvidia-container-runtime	3.4.0-1
nvidia-docker2	2.5.0-1
nvidia-driver	>455
python-pip	>21.06
nvidia-pyindex

mattcarp88 · October 11, 2021, 3:51pm

Same issue with 3.0dp container

mattcarp88 · October 11, 2021, 3:52pm

@Morganh Are those dependencies causing the ImportError: cannot import name 'ONNXEngineBuilder' exception?

Morganh · October 11, 2021, 4:38pm

I am afraid there is the culprit.
See some result from my side after triggering tao v3.21.08 docker.

morganh@dl:~$ docker exec -it cf0924e65ced /bin/bash
root@cf0924e65ced:/workspace#
root@cf0924e65ced:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from modulus.export._tensorrt import ONNXEngineBuilder
2021-10-11 16:36:35.719922: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.

mattcarp88 · October 11, 2021, 5:19pm

@Morganh BTW that migration guide does not mention the breaking changes in the spec file formats.

Topic		Replies	Views
LPRNet Error TAO Toolkit	13	231	June 19, 2024
Problem with tlt file mounting TAO Toolkit	29	2373	January 6, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1520	July 6, 2022
TAO data services Error response from daemon: No such container dataset convert error from kitti to COCO TAO Toolkit	14	436	June 11, 2024
Train with my own tlt model #2 TAO Toolkit	42	2797	February 8, 2022
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1725	July 10, 2023
TLT V2.0 Classification TAO Toolkit	26	2806	August 3, 2021
Tao toolkit Error while fetching server API version TAO Toolkit	19	1903	June 15, 2023
TAO toolkit happend some .so bug TAO Toolkit tao	19	916	September 9, 2022
Unable to deploy TAO 4.0.1 yolov4 model on deepstream6.0 TAO Toolkit deepstream	43	1094	August 18, 2023

TAO container fails on Google Vertex AI

Related topics