TAO container fails on Google Vertex AI

OK, so, you did not use tao-launcher and just use the similar trigger way as TLT2.0.
See TLT 2.0 guide Requirements and Installation — Transfer Learning Toolkit 2.0 documentation

Could you add --runtime=nvidia ?

I am not having trouble with running the container locally. I am having trouble running the container in Google Vertex AI.

I was previously able to run the TLT 2 container successfully in this environment, but now with same setup and same scripts I am receiving the above error with TAO 3 container.

Why would it break from TLT 2 to TAO 3?

What is a trigger? What am I supposed to add --runtime=nvidia to and what is that supposed to do?

But you just told me that you have issues when run with your local host PC.

As mentioned multiple times above, I have the issue running locally and omitting --gpus all

docker run --gpus all nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 allows me to run classification train successfully.

docker run nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 gives the above error when running classification train.

This is only relevant as I thought this clue would be helpful in debugging the actual issue I am having with Google Vertex AI.

How many gpus in your local host PC?
Please share the result of nvidia-smi as well.

One gpu.

Thu Oct  7 12:25:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:08:00.0  On |                  N/A |
| 20%   41C    P0    N/A /  75W |    683MiB /  4036MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Please double check the software requirement.
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#software-requirements

And as mentioned above, please add --runtime=nvidia
docker run --runtime=nvidia -it nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 /bin/bash

Thank you for your response @Morganh

I cannot add --runtime=nvidia to any config for Google Vertex AI.

Are you able to give me any guidance on how I can get TAO working in this environment?

Is there any reason you can think of why the break from TLT 2 to TAO 3?

I suggest we should narrow down step by step. Because you already failed in running with local host PC. So, we shall address it firstly.

I did not fail running locally.

But you mention that
docker run nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 gives the above error when running classification train .

Adding --gpus all fixes the error

@Morganh I really appreciate your help with this issue. This issue is blocking us from upgrading to TAO 3.

It does not make sense it can only work with adding --gpus all.
So, we need to figure out why it is failed without adding --gpus all.

Thank you, I understand.

Deleting --gpus all and adding --runtime=nvidia works successfully.

@Morganh do you have any updates on my inquiry

According to your latest comment, you can run the training successfully in local host PC.