TAO container fails on Google Vertex AI

mattcarp88 · October 5, 2021, 2:12pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : Google Vertex AI
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): TAO 3.0

When attempting to run a custom job using the TAO container on google vertex ai platform, I receive the following error:
ImportError: cannot import name 'ONNXEngineBuilder'

Note I have received the same error when running the container locally and forget to use --gpus all .

I was previously able to run training jobs on vertex ai platform using the TLT 2.0 container, and am trying to upgrade to TAO 3.0 but ran into this issue.

Any guidance is much appreciated.

Morganh · October 5, 2021, 4:45pm

Please check TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.05 documentation

mattcarp88 · October 5, 2021, 7:01pm

Thank you @Morganh for your response. I have read the Quick Start guide. Is there a specific topic you could point me to that relates to my question?

Morganh · October 6, 2021, 3:23pm

It is the first time I get above error from tao user.
Did you update to TAO?
Can you share the result of tao info --verbose ?

mattcarp88 · October 6, 2021, 5:00pm

I am using the TAO docker container, nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3. I am not using the TAO python package.

When I run classification train inside the docker container, but started the docker container without --gpus all flag, I get the above error. Can you reproduce this @Morganh ?

When I attempt to run a custom job in Google Vertex AI, I get the same error. We were previously using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 docker container with Vertex AI successfully in production.

Why would Vertex AI jobs break from TLT 2 to TAO 3?

mattcarp88 · October 6, 2021, 5:01pm

I created a stack overflow post for this topic as well

https://stackoverflow.com/questions/69464723/nvidia-tao-3-0-docker-container-doesnt-run-on-google-vertex-ai

mattcarp88 · October 6, 2021, 5:07pm

This Running TAO Toolkit on Google Cloud Platform — TAO Toolkit 3.22.05 documentation guide explains how to run a docker container in a google cloud VM.

I am attempting to use the TAO docker container with Google Vertex AI platform. I was (and still am) successfully using TLT 2.0 in this setup, but would like to upgrade to TAO 3.0.

It is much more convenient to use the serverless vertex ai platform to run ephemeral jobs than starting a VM every time I need to train a model.

@Morganh do you know if nvidia tests these docker containers on google cloud ?

Morganh · October 7, 2021, 3:34pm

No, I cannot reproduce. Can you have a quick run on your host PC instead of Google Vertex AI?

Morganh · October 7, 2021, 3:35pm

Yes, the guide is also shared in Running TAO Toolkit on Google Cloud Platform - NVIDIA Docs

mattcarp88 · October 7, 2021, 4:07pm

I have done that already as described in my previous posts.

mattcarp88 · October 7, 2021, 4:08pm

This is not relevant to the topic.

Morganh · October 7, 2021, 4:09pm

Did you ever run on your host PC for TAO 3.0?

mattcarp88 · October 7, 2021, 4:10pm

Yes I have, as described in my previous posts.

Morganh · October 7, 2021, 4:10pm

Which link?

mattcarp88 · October 7, 2021, 4:10pm

I don’t understand your question

mattcarp88 · October 7, 2021, 4:11pm

I really appreciate your helping me with this issue @Morganh

Morganh · October 7, 2021, 4:13pm

I read your above post again. So, with your local host PC instead of Google Vertex AI, you still meet error, right?

mattcarp88 · October 7, 2021, 4:13pm

Yes

Morganh · October 7, 2021, 4:14pm

OK, so, let’s focus on your local host PC.
How did you trigger TLT/TAO container? Did you remember? Is there a command?

mattcarp88 · October 7, 2021, 4:15pm

docker run nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3