TAO yolov4_tiny training fails with error

d.lugli · February 1, 2023, 8:24am

I’m trying to train my first AI using the TAO toolkit, following the Jupyter notepad “yolo_v4_tiny.ipynb” step-by-step.
I managed to follow the steps without any problem until I tried starting the training process with the following piece of code:

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

This code always terminates with an error, here’s the log:

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2023-01-27 12:32:33,264 [INFO] root: Registry: ['nvcr.io']
2023-01-27 12:32:33,318 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-01-27 12:32:33,328 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/techboard/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2023-01-27 11:32:39.189462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
2023-01-27 11:32:41,337 [WARNING] modulus.export._tensorrt: Failed to import TRT and/or CUDA. TensorRT optimization and inference will not be available.
2023-01-27 11:32:41,337 [WARNING] iva.common.export.keras_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
2023-01-27 11:32:41,341 [WARNING] iva.common.export.base_exporter: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
--------------------------------------------------------------------------
An error occurred while trying to map in the address of a function.
  Function Name: cuIpcOpenMemHandle_v2
  Error string:  /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2
CUDA-aware support is disabled.
--------------------------------------------------------------------------
[4fbf0fac2a90:154  :0:209] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:    209) ====
 0 0x0000000000043090 killpg()  ???:0
=================================
[4fbf0fac2a90:00154] *** Process received signal ***
[4fbf0fac2a90:00154] Signal: Segmentation fault (11)
[4fbf0fac2a90:00154] Signal code:  (-6)
[4fbf0fac2a90:00154] Failing at address: 0x9a
[4fbf0fac2a90:00154] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5f6a152090]
[4fbf0fac2a90:00154] *** End of error message ***
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-27 12:32:45,583 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I’m running TAO on the following machine:
• OS: Ubuntu 22.04.1 LTS
• GPU: Quadro 4000/PCIe/SSE2
• CPU: Intel® Core™ i7-10700 CPU @ 2.90GHz × 16

I followed every previous steps without any changes and downloaded the required dataset in the right location, so I really don’t know what the problem could be related to.

Thanks in advance

Morganh · February 1, 2023, 5:24pm

Can you share the result of $nvidia-smi ?

d.lugli · February 2, 2023, 7:39am

Thu Feb  2 08:38:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.157                Driver Version: 390.157                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 4000         Off  | 00000000:01:00.0  On |                  N/A |
| 36%   52C    P1    N/A /  N/A |    276MiB /  1982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1111      G   /usr/lib/xorg/Xorg                            99MiB |
|    0      1871      G   /usr/bin/gnome-shell                         143MiB |
|    0      3033      G   /snap/firefox/1635/usr/lib/firefox/firefox     1MiB |
|    0     38952      G   ...features=SpareRendererForSitePerProcess    22MiB |
|    0     88924      G   /usr/bin/nvidia-settings                       3MiB |
|    0    129559      G   gnome-control-center                           1MiB |
+-----------------------------------------------------------------------------+

Morganh · February 2, 2023, 3:43pm

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please update the driver.

sudo apt purge nvidia-driver-390
sudo apt autoremove
sudo apt autoclean
sudo apt install nvidia-driver-520

system · March 6, 2023, 10:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when training with multiple GPUs in TAO TAO Toolkit	17	1963	May 4, 2023
Errors during training in TAO TAO Toolkit	3	394	January 6, 2024
Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) TAO Toolkit gpio , tao	6	233	May 21, 2024
Training with multiple GPUs has error using TAO toolkit TAO Toolkit	17	1176	July 19, 2022
Tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit	18	1829	July 26, 2022
TAO not running when using multiple GPUs TAO Toolkit	12	43	August 17, 2024
TAO training on multiple gpus failed TAO Toolkit	10	1151	March 9, 2023
AssertionError: output_channel must be either 1 or 3. 2021-11-01 11:27:31,565 [INFO] tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit tensorrt , yolo	3	659	November 15, 2021
TAO yolo_v3 google colab training failure TAO Toolkit	6	213	May 14, 2024
TAO Toolkit 5.5.0 - cuInit failed: no CUDA-capable device is detected TAO Toolkit cuda	6	83	January 14, 2025

TAO yolov4_tiny training fails with error

Related topics