TAO container fails on Google Vertex AI

FYI, gcloud beta ai custom-jobs local-run fails as well with the same exception as above.

Can you share the full command you are running? Full log is also appreciated.

More, could you please run below under TAO3.0 and Google Vertex AI environment?

  1. $ nvidia-smi
  2. $ python
    >>> from modulus.export._tensorrt import ONNXEngineBuilder

Please share me with the result. Thanks.

Tue Oct 12 12:12:48 2021       
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
 | N/A   33C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
                                                                                
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |  No running processes found                                                 |
 +-----------------------------------------------------------------------------+

It appears the nvidia-driver is not updated enough. Just curious, why was this broken between TLT 2.0 and TAo 3.0?

@Morganh does the nvidia-driver version get inherited from the host operating system?

FYI here is the machine spec for my training job:

{
  "workerPoolSpecs": [
    {
      "machineSpec": {
        "machineType": "n1-standard-4",
        "acceleratorType": "NVIDIA_TESLA_P100",
        "acceleratorCount": 1
      },
      "replicaCount": "1",
      "diskSpec": {
        "bootDiskType": "pd-ssd",
        "bootDiskSizeGb": 100
      },
      "containerSpec": {
        "imageUri": "<my custom container based on TAO>",
        "command": [
          "nvidia-smi"
        ]
      }
    }
  ]
}

I will try the other machineType and acceleratorType options to see if they use the newer nvidia driver.

Yes, the nvidia-driver version gets inherited from the host operating system. For TAO 3.0, please meet software requirement to avoid unexpected error. For example, Nvidia TAO cuda version error - #3 by anil.vuppala

Until Google updates the version of the nvidia driver used on their servers that run the Vertex Ai containers, the newest release of TAO does not work on that platform.

We cannot upgrade until this is supported.

Thanks for the info.