TAO container fails on Google Vertex AI

Bro are you serious you haven’t once addressed the actual topic of this thread.

Can you follow Running TAO Toolkit on Google Cloud Platform — TAO Toolkit 3.22.05 documentation in the Google Vertex AI ?

Can you explain the breaking change from tlt 2 to Tao 3?

I already answered that in a previous post. Maybe you should read through what we’ve discussed already??

Please see Migrating to TAO Toolkit — TAO Toolkit 3.22.05 documentation .
And according to above comments we discussed, the error may result in nvidia-docker, so it is not related to the migration from tlt2 to tao3.

“the error may result in nvidia-docker” what does that mean?

Above comment from you gives the hint. Please install nvidia-docker2.

So you can’t explain the breaking change??

For the change between tlt2 and tao3, please see Migrating to TAO Toolkit — TAO Toolkit 3.0 documentation.

Was nvidia-docker2 not required for TLT 2? Is that the breaking change? Your link is not helpful to my inquiry.

See Requirements and Installation — Transfer Learning Toolkit 2.0 documentation, it is needed.

I am asking for your help and guidance and you are just giving me links to documents I’ve already read.

Can you help me with this?

I need more info from your side.
TLT 2.0 container + Google Vertex AI : Is it successful? Can you share more detail how did you trigger 2.0 container?

TAO 3.0 container + Google Vertex AI : Is it successful? Can you share more detail how did you trigger tao 3.0 container?

And did you ever try
tlt3.0 dp version container + Google Vertex AI ? See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
The tlt 3.0 dp version is released on 2/10/2021. docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

To run a training job in Vertex AI I follow the documentation here: Create a model using custom training  |  Vertex AI  |  Google Cloud.

Basically I upload the container to the cloud, then launch a job with vertex ai telling it to use that container and what command to run inside the container.

I trigger the container the same way for both 2.0 and 3.0, the only thing that changes is the name of the container and the name of the commands to run, i.e. classification train instead of tlt-train classification.

Again, everything in my process is the same for 2.0 and 3.0 except for the container. But only 2.0 works and 3.0 does not work.

@Morganh can you think of what would cause this breaking change between 2.0 and 3.0?

What is TLT 3.0 dp? Developer preview? Why would I use that instead of the latest version of TAO?

Yes, dp means developer-preview. Just to narrow down the issue when I asked.
For breaking change between 2.0 and 3.0, except Migrating to TAO Toolkit — TAO Toolkit 3.22.05 documentation, please note that the requirement is also changed.
For 2.0, see Requirements and Installation — Transfer Learning Toolkit 2.0 documentation , Install NVIDIA GPU driver v410.xx or above.

For latest tao,
See TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.05 documentation

Software Version
Ubuntu 18.04 LTS 18.04
python >=3.6.9
docker-ce >19.03.5
docker-API 1.40
nvidia-container-toolkit >1.3.0-1
nvidia-container-runtime 3.4.0-1
nvidia-docker2 2.5.0-1
nvidia-driver >455
python-pip >21.06
nvidia-pyindex

Same issue with 3.0dp container

@Morganh Are those dependencies causing the ImportError: cannot import name 'ONNXEngineBuilder' exception?

I am afraid there is the culprit.
See some result from my side after triggering tao v3.21.08 docker.

morganh@dl:~$ docker exec -it cf0924e65ced /bin/bash
root@cf0924e65ced:/workspace#
root@cf0924e65ced:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from modulus.export._tensorrt import ONNXEngineBuilder
2021-10-11 16:36:35.719922: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.

@Morganh BTW that migration guide does not mention the breaking changes in the spec file formats.