TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)

AMD64 (AMD 5950) computer
(2) RTX 3080 TIs
Ubuntu 20.04
TAO 4.0.2 bare metal API installation
using automl/object_detection.ipynb

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
using TAO 4.0.2
tao-getting-started_v4.0.2/notebooks/tao_api_starter_kit/api/automl

• Training spec file(If have, please share here)
JMD_object_detection-Copy1.ipynb (106.2 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

My notebook is attached - logically, no different from the default:
automl/object_detection.ipynb notebook.

Training data generated w/ Deepstream (6.2) transfer_learning_app with a detectnet_v2 model. The notebook runs the training job (no errors returned - just keeps monitoring, no compute resources consumed.)

kubectl get services
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller   NodePort    10.107.222.63    <none>        80:32080/TCP,443:32443/TCP   8d
kubernetes                 ClusterIP   10.96.0.1        <none>        443/TCP                      8d
tao-toolkit-api-service    NodePort    10.103.203.163   <none>        8000:31951/TCP               8d

ubuntu@5950X:/home/jay$ kubectl logs tao-toolkit-api-workflow-pod-55b9bfc948-dndxz

nvidia driver modules are not yet loaded, invoking runc directly
NGC CLI 3.19.0
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/specs/87e78224-15b0-4930-9871-187e1c0b3501.yaml  > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 87e78224-15b0-4930-9871-187e1c0b3501
Post running
Job Done: 87e78224-15b0-4930-9871-187e1c0b3501 Final status: Done
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/specs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.yaml  > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9
Post running
Job Done: 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 Final status: Done
AutoML pipeline
detectnet_v2 train --gpus $NUM_GPUS  -e /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/recommendation_0.kitti -r /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0 -k tlt_encode > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
AutoML pipeline done

So this log says the job is done?

Looking for the experiment_{n}/log.txt

sudo find / -name log.txt
[sudo] password for jay: 
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/62de88a1-5e1b-4828-a254-c308517344d9/models/2e50ac8f-c7de-4fc5-a51d-be0407cdf696/82cf1e60-85ae-4c90-956c-bde04e885303/experiment_0/log.txt
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
udo tail -ff /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL

setup validation

bash setup.sh validate

# a portion of the output (but no errors)
TASK [Report Versions] ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [127.0.0.2] => {
    "msg": [
        "===========================================================================================",
        "  Components                          Matrix Version          ||     Installed Version    ",
        "===========================================================================================",
        "GPU Operator Version                  v1.10.1                 ||     v1.10.1",
        "Nvidia Container Driver Version       510.47.03               ||     510.47.03",
        "GPU Operator NV Toolkit Driver        v1.9.0                  ||     4.0.2",
        "K8sDevice Plugin Version              v0.11.0                 ||     v0.11.0",
        "Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4             ||     2.3.4-2.6.4",
        "Node Feature Discovery Version        v0.10.1                 ||     v0.10.1",
        "GPU Feature Discovery Version         v0.5.0                  ||     v0.5.0",
        "Nvidia validator version              v1.10.1                 ||     v1.10.1",
        "Nvidia MIG Manager version            0.3.0                   ||     ",
        "",
        "Note: NVIDIA Mig Manager is valid for only Amphere GPU's like A100, A30",
        "",
        "Please validate between Matrix Version and Installed Version listed above"
    ]
}

again, I’m running w/ (2) NVIDIA RTX 3080 TIs
I’m assuing this is related to GPU drivers and architecture - the API install had no errors on installation or validation.

According to Search results for 'ptx was compiled with an unsupported toolchain #intelligent-video-analytics:tao-toolkit ' - NVIDIA Developer Forums, usually it is related to nvidia driver.
But they are all the cases which are not from TAO api.

Could you set to below version when use (./setup/quickstart_api_bare_metal/gpu-operator-values.yml) ?
525.60.13

Bingo! That fixed it.

use a newer driver: 525.60.13

  1. go to your directory with setup.sh For me, it was:
    /project/tao/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal

  2. you can $ bash setup.sh install
    a. then supply 525.60.13 at the prompt.
    b. I entered: 525.60.13 and it ignored my prompt, installed with 510
    c. 2nd try, I edited gpu-operator-values.yml (putting 525.60.13 in obvious location)
    d. maybe “525.60.13” would have worked - I didn’t try that

  3. install was fine

  4. you have to re-fix the connection problem:
    a. Tao Toolkit API cannot login and got 401 unauthorized - #3 by Morganh

  5. rerun the notebook

Because you don’t have an NVIDIA driver installed, you can’t see much.
On my AMD 5950, modest CPU utilization (similar to TAO jobs in the past. Not a lot but 10-20%)
you can monitor via the experiment log.

sudo find / -name log.txt

in my case, since I’ve done this a few times, 3 hits. The current log was NOT the last one in the list. Check the dates by going back through the hits with $ ls -l

Then you can tail the log - this is better than watching the notebook:

sudo tail -ff /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-b22a191d-fda5-453c-8a3d-c0655dd852ba/users/7232e3cb-6aee-4b75-a485-ee65803508bc/models/360601a1-2f80-41c2-9161-c32e303a24b4/9598d22d-9ef7-4222-a82e-e88a1e9ef944/experiment_0/log.txt

LAST QUESTION

(then you can close this case),
How can I verify it is using both GPUs (I have two watercooled 3080 TIs) - no fans to tell me it’s under load.

  • there is no nvidia driver
  • in the kubectl pod, no nvidia-smi
  • I checked the pods, I can’t find a way to check.
  • I checked logs, NVIDIA logs usually show GPU connections when training starts - I could find nothing
kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
9598d22d-9ef7-4222-a82e-e88a1e9ef944-mqvr5         1/1     Running   0          15m
f336f97d-bfc8-457b-96a7-b3210876b95e-xd48t         1/1     Running   0          15m
ingress-nginx-controller-5ff6555d5d-95npf          1/1     Running   0          25m
nfs-subdir-external-provisioner-5f9cbb4554-2k4mc   1/1     Running   0          25m
nvidia-smi-5950x                                   1/1     Running   0          25m
tao-toolkit-api-app-pod-54c9c75fbc-rlvqx           1/1     Running   0          25m
tao-toolkit-api-workflow-pod-55b9bfc948-qx9vq      1/1     Running   0          25m

ubuntu@5950X:/home/jay$ kubectl exec --stdin --tty tao-toolkit-api-app-pod-54c9c75fbc-rlvqx -- /bin/bash
root@tao-toolkit-api-app-pod-54c9c75fbc-rlvqx:/opt/api# nvidia-smi
bash: nvidia-smi: command not found

You can try below.
$ kubectl exec nvidia-smi-5950x -- nvidia-smi

perfect. Thanks for all of your help. The job is running perfectly.
My training job is only utilizing 1 of the 2 GPUs - further research led me to some helpful notes:

single vs multi-GPU

https://docs.nvidia.com/tao/tao-toolkit/text/faqs.html
[See: Training, multi-GPU vs single GPU]

How to use multi GPU training in tao toolkit

https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_deployment.html

I will try that when this job finishes.
My detectnet_v2 training set is 5000+ images. Training running on a single RTX 3080 TI is taking 56 hrours - but so far it is generating results significantly better than the standard object_detection process in TAO 3.0

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

For running multi gpus, you can set numGpu in tao-toolkit-api/values.yaml.
numGpu is the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now.

Refer to Deployment - NVIDIA Docs

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.