TAO container fails on Google Vertex AI

mattcarp88 · October 11, 2021, 6:13pm

FYI, gcloud beta ai custom-jobs local-run fails as well with the same exception as above.

Morganh · October 11, 2021, 11:42pm

Can you share the full command you are running? Full log is also appreciated.

Morganh · October 12, 2021, 3:12am

More, could you please run below under TAO3.0 and Google Vertex AI environment?

$ nvidia-smi
$ python
>>> from modulus.export._tensorrt import ONNXEngineBuilder

Please share me with the result. Thanks.

mattcarp88 · October 12, 2021, 12:22pm

Tue Oct 12 12:12:48 2021       
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
 | N/A   33C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
                                                                                
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |  No running processes found                                                 |
 +-----------------------------------------------------------------------------+

It appears the nvidia-driver is not updated enough. Just curious, why was this broken between TLT 2.0 and TAo 3.0?

mattcarp88 · October 12, 2021, 12:26pm

@Morganh does the nvidia-driver version get inherited from the host operating system?

mattcarp88 · October 12, 2021, 12:34pm

FYI here is the machine spec for my training job:

{
  "workerPoolSpecs": [
    {
      "machineSpec": {
        "machineType": "n1-standard-4",
        "acceleratorType": "NVIDIA_TESLA_P100",
        "acceleratorCount": 1
      },
      "replicaCount": "1",
      "diskSpec": {
        "bootDiskType": "pd-ssd",
        "bootDiskSizeGb": 100
      },
      "containerSpec": {
        "imageUri": "<my custom container based on TAO>",
        "command": [
          "nvidia-smi"
        ]
      }
    }
  ]
}

I will try the other machineType and acceleratorType options to see if they use the newer nvidia driver.

Morganh · October 12, 2021, 1:24pm

Yes, the nvidia-driver version gets inherited from the host operating system. For TAO 3.0, please meet software requirement to avoid unexpected error. For example, Nvidia TAO cuda version error - #3 by anil.vuppala

mattcarp88 · October 12, 2021, 5:04pm

Until Google updates the version of the nvidia driver used on their servers that run the Vertex Ai containers, the newest release of TAO does not work on that platform.

We cannot upgrade until this is supported.

Morganh · October 13, 2021, 1:29am

Thanks for the info.

system · October 27, 2021, 1:30am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LPRNet Error TAO Toolkit	13	228	June 19, 2024
Problem with tlt file mounting TAO Toolkit	29	2348	January 6, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1511	July 6, 2022
TAO data services Error response from daemon: No such container dataset convert error from kitti to COCO TAO Toolkit	14	433	June 11, 2024
Train with my own tlt model #2 TAO Toolkit	42	2778	February 8, 2022
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1704	July 10, 2023
TLT V2.0 Classification TAO Toolkit	26	2787	August 3, 2021
Tao toolkit Error while fetching server API version TAO Toolkit	19	1898	June 15, 2023
TAO toolkit happend some .so bug TAO Toolkit tao	19	906	September 9, 2022
Unable to deploy TAO 4.0.1 yolov4 model on deepstream6.0 TAO Toolkit deepstream	43	1083	August 18, 2023

TAO container fails on Google Vertex AI

Related topics