I’m trying to train my first AI using the TAO toolkit, following the Jupyter notepad “yolo_v4_tiny.ipynb” step-by-step.
I managed to follow the steps without any problem until I tried starting the training process with the following piece of code:
print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
--gpus 1
This code always terminates with an error, here’s the log:
To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2023-01-27 12:32:33,264 [INFO] root: Registry: ['nvcr.io']
2023-01-27 12:32:33,318 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-01-27 12:32:33,328 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/techboard/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2023-01-27 11:32:39.189462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
2023-01-27 11:32:41,337 [WARNING] modulus.export._tensorrt: Failed to import TRT and/or CUDA. TensorRT optimization and inference will not be available.
2023-01-27 11:32:41,337 [WARNING] iva.common.export.keras_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
2023-01-27 11:32:41,341 [WARNING] iva.common.export.base_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
--------------------------------------------------------------------------
An error occurred while trying to map in the address of a function.
Function Name: cuIpcOpenMemHandle_v2
Error string: /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2
CUDA-aware support is disabled.
--------------------------------------------------------------------------
[4fbf0fac2a90:154 :0:209] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 209) ====
0 0x0000000000043090 killpg() ???:0
=================================
[4fbf0fac2a90:00154] *** Process received signal ***
[4fbf0fac2a90:00154] Signal: Segmentation fault (11)
[4fbf0fac2a90:00154] Signal code: (-6)
[4fbf0fac2a90:00154] Failing at address: 0x9a
[4fbf0fac2a90:00154] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5f6a152090]
[4fbf0fac2a90:00154] *** End of error message ***
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-27 12:32:45,583 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I’m running TAO on the following machine:
• OS: Ubuntu 22.04.1 LTS
• GPU: Quadro 4000/PCIe/SSE2
• CPU: Intel® Core™ i7-10700 CPU @ 2.90GHz × 16
I followed every previous steps without any changes and downloaded the required dataset in the right location, so I really don’t know what the problem could be related to.
Thanks in advance