TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain

jay.duff · July 15, 2023, 5:04pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)

AMD64 (AMD 5950) computer
(2) RTX 3080 TIs
Ubuntu 20.04
TAO 4.0.2 bare metal API installation
using automl/object_detection.ipynb

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
using TAO 4.0.2
tao-getting-started_v4.0.2/notebooks/tao_api_starter_kit/api/automl

• Training spec file(If have, please share here)
JMD_object_detection-Copy1.ipynb (106.2 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

My notebook is attached - logically, no different from the default:
automl/object_detection.ipynb notebook.

Training data generated w/ Deepstream (6.2) transfer_learning_app with a detectnet_v2 model. The notebook runs the training job (no errors returned - just keeps monitoring, no compute resources consumed.)

kubectl get services
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller   NodePort    10.107.222.63    <none>        80:32080/TCP,443:32443/TCP   8d
kubernetes                 ClusterIP   10.96.0.1        <none>        443/TCP                      8d
tao-toolkit-api-service    NodePort    10.103.203.163   <none>        8000:31951/TCP               8d

ubuntu@5950X:/home/jay$ kubectl logs tao-toolkit-api-workflow-pod-55b9bfc948-dndxz

nvidia driver modules are not yet loaded, invoking runc directly
NGC CLI 3.19.0
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/specs/87e78224-15b0-4930-9871-187e1c0b3501.yaml  > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 87e78224-15b0-4930-9871-187e1c0b3501
Post running
Job Done: 87e78224-15b0-4930-9871-187e1c0b3501 Final status: Done
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/specs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.yaml  > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9
Post running
Job Done: 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 Final status: Done
AutoML pipeline
detectnet_v2 train --gpus $NUM_GPUS  -e /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/recommendation_0.kitti -r /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0 -k tlt_encode > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
AutoML pipeline done

So this log says the job is done?

Looking for the experiment_{n}/log.txt

sudo find / -name log.txt
[sudo] password for jay: 
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/62de88a1-5e1b-4828-a254-c308517344d9/models/2e50ac8f-c7de-4fc5-a51d-be0407cdf696/82cf1e60-85ae-4c90-956c-bde04e885303/experiment_0/log.txt
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt

udo tail -ff /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL

setup validation

bash setup.sh validate

# a portion of the output (but no errors)
TASK [Report Versions] ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [127.0.0.2] => {
    "msg": [
        "===========================================================================================",
        "  Components                          Matrix Version          ||     Installed Version    ",
        "===========================================================================================",
        "GPU Operator Version                  v1.10.1                 ||     v1.10.1",
        "Nvidia Container Driver Version       510.47.03               ||     510.47.03",
        "GPU Operator NV Toolkit Driver        v1.9.0                  ||     4.0.2",
        "K8sDevice Plugin Version              v0.11.0                 ||     v0.11.0",
        "Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4             ||     2.3.4-2.6.4",
        "Node Feature Discovery Version        v0.10.1                 ||     v0.10.1",
        "GPU Feature Discovery Version         v0.5.0                  ||     v0.5.0",
        "Nvidia validator version              v1.10.1                 ||     v1.10.1",
        "Nvidia MIG Manager version            0.3.0                   ||     ",
        "",
        "Note: NVIDIA Mig Manager is valid for only Amphere GPU's like A100, A30",
        "",
        "Please validate between Matrix Version and Installed Version listed above"
    ]
}

again, I’m running w/ (2) NVIDIA RTX 3080 TIs
I’m assuing this is related to GPU drivers and architecture - the API install had no errors on installation or validation.

Morganh · July 15, 2023, 6:00pm

According to Search results for 'ptx was compiled with an unsupported toolchain #intelligent-video-analytics:tao-toolkit ' - NVIDIA Developer Forums, usually it is related to nvidia driver.
But they are all the cases which are not from TAO api.

Could you set to below version when use (./setup/quickstart_api_bare_metal/gpu-operator-values.yml) ?
525.60.13

jay.duff · July 15, 2023, 6:48pm

Bingo! That fixed it.

use a newer driver: 525.60.13

go to your directory with setup.sh For me, it was:
/project/tao/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal
you can $ bash setup.sh install
a. then supply 525.60.13 at the prompt.
b. I entered: 525.60.13 and it ignored my prompt, installed with 510
c. 2nd try, I edited gpu-operator-values.yml (putting 525.60.13 in obvious location)
d. maybe “525.60.13” would have worked - I didn’t try that
install was fine
you have to re-fix the connection problem:
a. Tao Toolkit API cannot login and got 401 unauthorized - #3 by Morganh
rerun the notebook

Because you don’t have an NVIDIA driver installed, you can’t see much.
On my AMD 5950, modest CPU utilization (similar to TAO jobs in the past. Not a lot but 10-20%)
you can monitor via the experiment log.

sudo find / -name log.txt

in my case, since I’ve done this a few times, 3 hits. The current log was NOT the last one in the list. Check the dates by going back through the hits with $ ls -l

Then you can tail the log - this is better than watching the notebook:

sudo tail -ff /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-b22a191d-fda5-453c-8a3d-c0655dd852ba/users/7232e3cb-6aee-4b75-a485-ee65803508bc/models/360601a1-2f80-41c2-9161-c32e303a24b4/9598d22d-9ef7-4222-a82e-e88a1e9ef944/experiment_0/log.txt

LAST QUESTION

(then you can close this case),
How can I verify it is using both GPUs (I have two watercooled 3080 TIs) - no fans to tell me it’s under load.

there is no nvidia driver
in the kubectl pod, no nvidia-smi
I checked the pods, I can’t find a way to check.
I checked logs, NVIDIA logs usually show GPU connections when training starts - I could find nothing

kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
9598d22d-9ef7-4222-a82e-e88a1e9ef944-mqvr5         1/1     Running   0          15m
f336f97d-bfc8-457b-96a7-b3210876b95e-xd48t         1/1     Running   0          15m
ingress-nginx-controller-5ff6555d5d-95npf          1/1     Running   0          25m
nfs-subdir-external-provisioner-5f9cbb4554-2k4mc   1/1     Running   0          25m
nvidia-smi-5950x                                   1/1     Running   0          25m
tao-toolkit-api-app-pod-54c9c75fbc-rlvqx           1/1     Running   0          25m
tao-toolkit-api-workflow-pod-55b9bfc948-qx9vq      1/1     Running   0          25m

ubuntu@5950X:/home/jay$ kubectl exec --stdin --tty tao-toolkit-api-app-pod-54c9c75fbc-rlvqx -- /bin/bash
root@tao-toolkit-api-app-pod-54c9c75fbc-rlvqx:/opt/api# nvidia-smi
bash: nvidia-smi: command not found

Morganh · July 16, 2023, 2:59am

You can try below.
$ kubectl exec nvidia-smi-5950x -- nvidia-smi

jay.duff · July 16, 2023, 1:21pm

perfect. Thanks for all of your help. The job is running perfectly.
My training job is only utilizing 1 of the 2 GPUs - further research led me to some helpful notes:

single vs multi-GPU

[See: Training, multi-GPU vs single GPU]

How to use multi GPU training in tao toolkit

I will try that when this job finishes.
My detectnet_v2 training set is 5000+ images. Training running on a single RTX 3080 TI is taking 56 hrours - but so far it is generating results significantly better than the standard object_detection process in TAO 3.0

Morganh · July 17, 2023, 9:21am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

For running multi gpus, you can set numGpu in tao-toolkit-api/values.yaml.
numGpu is the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now.

Refer to Kubernetes Deployment — Tao Toolkit

Topic		Replies	Views
AutoML training speed and GPU problem TAO Toolkit	27	1891	March 29, 2023
No CUDA-capable device is detected - yolov4 TAO Toolkit	9	381	August 16, 2024
Error when training with multiple GPUs in TAO TAO Toolkit	16	2241	April 20, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	46	5471	April 9, 2023
TAO AutoML - TAO Toolkit Setup TAO Toolkit ubuntu	11	835	May 22, 2023
Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) TAO Toolkit gpio , tao	5	357	May 21, 2024
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	79	3177	October 11, 2023
TAO Toolkit 4.0 setup issue TAO Toolkit	18	3127	January 5, 2023
No CUDA-capable device is detected TAO Toolkit cuda , tao	9	280	February 17, 2025
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	56	2651	August 15, 2023

TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain

use a newer driver: 525.60.13

LAST QUESTION

single vs multi-GPU

How to use multi GPU training in tao toolkit

Related topics