TAO container fails on Google Vertex AI

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : Google Vertex AI
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): TAO 3.0

When attempting to run a custom job using the TAO container on google vertex ai platform, I receive the following error:
ImportError: cannot import name 'ONNXEngineBuilder'

Note I have received the same error when running the container locally and forget to use --gpus all .

I was previously able to run training jobs on vertex ai platform using the TLT 2.0 container, and am trying to upgrade to TAO 3.0 but ran into this issue.

Any guidance is much appreciated.

Please check TAO Toolkit Quick Start Guide — TAO Toolkit 3.0 documentation

Thank you @Morganh for your response. I have read the Quick Start guide. Is there a specific topic you could point me to that relates to my question?

It is the first time I get above error from tao user.
Did you update to TAO?
Can you share the result of tao info --verbose ?

I am using the TAO docker container, nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3. I am not using the TAO python package.

When I run classification train inside the docker container, but started the docker container without --gpus all flag, I get the above error. Can you reproduce this @Morganh ?

When I attempt to run a custom job in Google Vertex AI, I get the same error. We were previously using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 docker container with Vertex AI successfully in production.

Why would Vertex AI jobs break from TLT 2 to TAO 3?

I created a stack overflow post for this topic as well

This Running TAO Toolkit on Google Cloud Platform — TAO Toolkit 3.0 documentation guide explains how to run a docker container in a google cloud VM.

I am attempting to use the TAO docker container with Google Vertex AI platform. I was (and still am) successfully using TLT 2.0 in this setup, but would like to upgrade to TAO 3.0.

It is much more convenient to use the serverless vertex ai platform to run ephemeral jobs than starting a VM every time I need to train a model.

@Morganh do you know if nvidia tests these docker containers on google cloud ?

No, I cannot reproduce. Can you have a quick run on your host PC instead of Google Vertex AI?

Yes, the guide is also shared in Running TAO Toolkit on Google Cloud Platform — TAO Toolkit 3.0 documentation

I have done that already as described in my previous posts.

This is not relevant to the topic.

Did you ever run on your host PC for TAO 3.0?

Yes I have, as described in my previous posts.

Which link?

I don’t understand your question

I really appreciate your helping me with this issue @Morganh

I read your above post again. So, with your local host PC instead of Google Vertex AI, you still meet error, right?

Yes

OK, so, let’s focus on your local host PC.
How did you trigger TLT/TAO container? Did you remember? Is there a command?

docker run nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3