TAO yolov4_tiny training fails with error

I’m trying to train my first AI using the TAO toolkit, following the Jupyter notepad “yolo_v4_tiny.ipynb” step-by-step.
I managed to follow the steps without any problem until I tried starting the training process with the following piece of code:

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

This code always terminates with an error, here’s the log:

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2023-01-27 12:32:33,264 [INFO] root: Registry: ['nvcr.io']
2023-01-27 12:32:33,318 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-01-27 12:32:33,328 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/techboard/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2023-01-27 11:32:39.189462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
2023-01-27 11:32:41,337 [WARNING] modulus.export._tensorrt: Failed to import TRT and/or CUDA. TensorRT optimization and inference will not be available.
2023-01-27 11:32:41,337 [WARNING] iva.common.export.keras_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
2023-01-27 11:32:41,341 [WARNING] iva.common.export.base_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
--------------------------------------------------------------------------
An error occurred while trying to map in the address of a function.
  Function Name: cuIpcOpenMemHandle_v2
  Error string:  /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2
CUDA-aware support is disabled.
--------------------------------------------------------------------------
[4fbf0fac2a90:154  :0:209] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:    209) ====
 0 0x0000000000043090 killpg()  ???:0
=================================
[4fbf0fac2a90:00154] *** Process received signal ***
[4fbf0fac2a90:00154] Signal: Segmentation fault (11)
[4fbf0fac2a90:00154] Signal code:  (-6)
[4fbf0fac2a90:00154] Failing at address: 0x9a
[4fbf0fac2a90:00154] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5f6a152090]
[4fbf0fac2a90:00154] *** End of error message ***
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-27 12:32:45,583 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I’m running TAO on the following machine:
• OS: Ubuntu 22.04.1 LTS
• GPU: Quadro 4000/PCIe/SSE2
• CPU: Intel® Core™ i7-10700 CPU @ 2.90GHz × 16

I followed every previous steps without any changes and downloaded the required dataset in the right location, so I really don’t know what the problem could be related to.

Thanks in advance

Can you share the result of $nvidia-smi ?

Thu Feb  2 08:38:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.157                Driver Version: 390.157                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 4000         Off  | 00000000:01:00.0  On |                  N/A |
| 36%   52C    P1    N/A /  N/A |    276MiB /  1982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1111      G   /usr/lib/xorg/Xorg                            99MiB |
|    0      1871      G   /usr/bin/gnome-shell                         143MiB |
|    0      3033      G   /snap/firefox/1635/usr/lib/firefox/firefox     1MiB |
|    0     38952      G   ...features=SpareRendererForSitePerProcess    22MiB |
|    0     88924      G   /usr/bin/nvidia-settings                       3MiB |
|    0    129559      G   gnome-control-center                           1MiB |
+-----------------------------------------------------------------------------+

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please update the driver.

sudo apt purge nvidia-driver-390
sudo apt autoremove
sudo apt autoclean
sudo apt install nvidia-driver-520

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.