Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
AMD64 (AMD 5950) computer
(2) RTX 3080 TIs
Ubuntu 20.04
TAO 4.0.2 bare metal API installation
using automl/object_detection.ipynb
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
using TAO 4.0.2
tao-getting-started_v4.0.2/notebooks/tao_api_starter_kit/api/automl
• Training spec file(If have, please share here)
JMD_object_detection-Copy1.ipynb (106.2 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
My notebook is attached - logically, no different from the default:
automl/object_detection.ipynb notebook.
Training data generated w/ Deepstream (6.2) transfer_learning_app with a detectnet_v2 model. The notebook runs the training job (no errors returned - just keeps monitoring, no compute resources consumed.)
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller NodePort 10.107.222.63 <none> 80:32080/TCP,443:32443/TCP 8d
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 8d
tao-toolkit-api-service NodePort 10.103.203.163 <none> 8000:31951/TCP 8d
ubuntu@5950X:/home/jay$ kubectl logs tao-toolkit-api-workflow-pod-55b9bfc948-dndxz
nvidia driver modules are not yet loaded, invoking runc directly
NGC CLI 3.19.0
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/specs/87e78224-15b0-4930-9871-187e1c0b3501.yaml > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/logs/87e78224-15b0-4930-9871-187e1c0b3501.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/734ad012-f111-49ee-ac24-962c444c4e0e/87e78224-15b0-4930-9871-187e1c0b3501/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 87e78224-15b0-4930-9871-187e1c0b3501
Post running
Job Done: 87e78224-15b0-4930-9871-187e1c0b3501 Final status: Done
detectnet_v2 dataset_convert --results_dir /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 --output_filename /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/specs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.yaml > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/logs/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9.txt; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type d | xargs chmod 777; find /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 -type f | xargs chmod 666 /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/datasets/ac6d700d-8e4b-4fa6-ad47-3ab3dc4202c2/1a902bb2-fdc6-4efd-a5eb-7756cb1709f9/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9
Post running
Job Done: 1a902bb2-fdc6-4efd-a5eb-7756cb1709f9 Final status: Done
AutoML pipeline
detectnet_v2 train --gpus $NUM_GPUS -e /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/recommendation_0.kitti -r /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0 -k tlt_encode > /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt 2>&1 >> /shared/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
AutoML pipeline done
So this log says the job is done?
Looking for the experiment_{n}/log.txt
sudo find / -name log.txt
[sudo] password for jay:
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/62de88a1-5e1b-4828-a254-c308517344d9/models/2e50ac8f-c7de-4fc5-a51d-be0407cdf696/82cf1e60-85ae-4c90-956c-bde04e885303/experiment_0/log.txt
/mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
udo tail -ff /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-e337edb2-9dda-47ad-968f-076f83e13937/users/2c90fef2-6f4f-415a-b1b6-0655ac9a1b5b/models/eb77f09a-4201-49ac-a2d2-2e4bf53fb175/3568b7c5-f389-4e1a-931d-ebe5cc6ffb92/experiment_0/log.txt
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
sess = session.Session(self._target, graph=self._graph, config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL
setup validation
bash setup.sh validate
# a portion of the output (but no errors)
TASK [Report Versions] ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [127.0.0.2] => {
"msg": [
"===========================================================================================",
" Components Matrix Version || Installed Version ",
"===========================================================================================",
"GPU Operator Version v1.10.1 || v1.10.1",
"Nvidia Container Driver Version 510.47.03 || 510.47.03",
"GPU Operator NV Toolkit Driver v1.9.0 || 4.0.2",
"K8sDevice Plugin Version v0.11.0 || v0.11.0",
"Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4 || 2.3.4-2.6.4",
"Node Feature Discovery Version v0.10.1 || v0.10.1",
"GPU Feature Discovery Version v0.5.0 || v0.5.0",
"Nvidia validator version v1.10.1 || v1.10.1",
"Nvidia MIG Manager version 0.3.0 || ",
"",
"Note: NVIDIA Mig Manager is valid for only Amphere GPU's like A100, A30",
"",
"Please validate between Matrix Version and Installed Version listed above"
]
}
again, I’m running w/ (2) NVIDIA RTX 3080 TIs
I’m assuing this is related to GPU drivers and architecture - the API install had no errors on installation or validation.