Jupyter Notebook SSD error

rferrandis · October 3, 2023, 6:21am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Windows 10 + WSL + DOCKER + GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Object detection SSD
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi,

I’m facing some issues with the Jupyter Notebooks, and I’m starting to think that maybe I am doing something wrong:

I started with the notebooks included in the getting started package, but they seemed to be out of date. Then I found a post where someone pointed to a direct download of the TAO 5.0 updated notebooks, but they have the same errors.
I am trying to run the SSD example, and the first issue that I encounter is that in the step for installing TAO (!pip3 install nvidia-tao), it installs the 4.0 version. This is what TAO info says:
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit’]
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

This is strange for me, but I tried to finish the example.

The next error that I encountered is when it tries to convert the dataset, the command is not working:

!tao model ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/ssd/tfrecords/kitti_train
-r $USER_EXPERIMENT_DIR/

It works if you remove “model” from it, but is this how it is supposed to work? Or are there other changes I need to be aware of?

I am asking this because if I keep going, when I try to train the model It downloads the docker 4.0-tf1 image, and it gives me some segmentation errors, so I am not sure if it is supposed to work like this or I’m doing something wrong.

And last thing, the key needed to run the training is nvidia_tlt or my NGC key?

rferrandis · October 3, 2023, 12:55pm

Hi,

Someone pointed in other post that the TAO 5.0 version is installed if you create an environment with Python 3.7 instead of 3.6. I started from scratch and now the commands from the notebook are working as expected, but please, could you update the quickstart guide software requirements?

python     >=3.6.9<3.7      Not needed if you use TAO toolkit API

The bad news, I have the exact same error while trying to train the model:

DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node af56a70c5a4c exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL

Morganh · October 3, 2023, 2:23pm

Thanks for the info. Will check further and update.

Morganh · October 3, 2023, 2:25pm

rferrandis:

DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node af56a70c5a4c exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL

Could you share the spec file? More, did you ever run official 5.0 SSD notebook successfully?

rferrandis · October 3, 2023, 2:31pm

Hi,

This is the first time that I’m trying to train a model with TAO. The spec file is this one: !cat $LOCAL_SPECS_DIR/ssd_tfrecords_kitti_train.txt

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/kitti_split/training"
  image_dir_name: "image"
  label_dir_name: "label"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 0
  num_shards: 10
}
image_directory_path: "/workspace/tao-experiments/data/kitti_split/training"
target_class_mapping {
    key: "car"
    value: "car"
}
target_class_mapping {
    key: "pedestrian"
    value: "pedestrian"
}
target_class_mapping {
    key: "cyclist"
    value: "cyclist"
}
target_class_mapping {
    key: "van"
    value: "car"
}
target_class_mapping {
    key: "person_sitting"
    value: "pedestrian"
}

Sorry if this is not what you asked for, I’m a bit lost here…

Morganh · October 3, 2023, 5:14pm

The 5.0 notebook is https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/ssd/ssd.ipynb.
You can running commands similar to
!tao model ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/ssd/tfrecords/kitti_train
-r $USER_EXPERIMENT_DIR/

The example spec file is: https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/ssd/specs/ssd_tfrecords_kitti_train.txt

I am afraid your images path if not available inside docker.
Please double check ~/tao_mounts.json file for the mapping.
You can also run command to check.
$ tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training

rferrandis · October 4, 2023, 7:16am

Hi Morganh,

The spec file is the same, an if I execute the command

!tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training

It shows the folders Image and Label, if I create a folder inside the training folder on the host PC it shows in the docker image too, and the image and label folders are full of images and txt files. The paths apparently are ok, right?

I’m going to check if the notebook is exactly the same, could be something regarding the user/login in the Docker Image? This is the tao_mounts.json:

{
    "Mounts": [
        {
            "source": "/home/eines/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/ssd/workspace",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/eines/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/ssd/specs",
            "destination": "/workspace/tao-experiments/ssd/specs"
        }
    ],
    "DockerOptions": {
        "user": "1000:1001"
    }
}

Another question, I’m using WSL with ubuntu 20.04, but the docker image is in the docker desktop for windows with WSL integration, and it seems to work, but it is the correct architecture? Or do I need to install Docker-ce within WSL?

rferrandis · October 4, 2023, 9:17am

Hi,

This is the error (Invalid permissions?):

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
[12c7ad3a43b8:00249] *** Process received signal ***
[12c7ad3a43b8:00249] Signal: Segmentation fault (11)
[12c7ad3a43b8:00249] Signal code: Invalid permissions (2)
[12c7ad3a43b8:00249] Failing at address: 0x800000019
[12c7ad3a43b8:00249] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f821db17090]
[12c7ad3a43b8:00249] [ 1] /usr/lib/wsl/drivers/nv_dispig.inf_amd64_7e5fd280efaa5445/libcuda.so.1.1(+0x24c3d0)[0x7f804e47d3d0]
[12c7ad3a43b8:00249] [ 2] /usr/lib/wsl/drivers/nv_dispig.inf_amd64_7e5fd280efaa5445/libcuda.so.1.1(+0x2c768f)[0x7f804e4f868f]
[12c7ad3a43b8:00249] [ 3] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x5fa2f0)[0x7f817eecd2f0]
[12c7ad3a43b8:00249] [ 4] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x658158)[0x7f817ef2b158]
[12c7ad3a43b8:00249] [ 5] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(daliDeletePipeline+0x3c)[0x7f817eab426c]
[12c7ad3a43b8:00249] [ 6] /usr/local/lib/python3.8/dist-packages/nvidia/dali_tf_plugin/libdali_tf_1_15.so(_ZN12dali_tf_impl6DaliOpD0Ev+0x54)[0x7f8129811ea4]
[12c7ad3a43b8:00249] [ 7] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow14CreateOpKernelENS_10DeviceTypeEPNS_10DeviceBaseEPNS_9AllocatorEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0x98d)[0x7f819aab11cd]
[12c7ad3a43b8:00249] [ 8] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow21CreateNonCachedKernelEPNS_6DeviceEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0xf2)[0x7f819ad58f52]
[12c7ad3a43b8:00249] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPNS_22FunctionLibraryRuntimeEPPNS_8OpKernelE+0x9a3)[0x7f819ad79003]
[12c7ad3a43b8:00249] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPPNS_8OpKernelE+0x18)[0x7f819ad793f8]
[12c7ad3a43b8:00249] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(+0x60356a3)[0x7f81a1c416a3]
[12c7ad3a43b8:00249] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow9OpSegment12FindOrCreateERKSsS2_PPNS_8OpKernelESt8functionIFNS_6StatusES5_EE+0x1ba)[0x7f819aab23ba]
[12c7ad3a43b8:00249] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(+0x6035c82)[0x7f81a1c41c82]
[12c7ad3a43b8:00249] [14] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1141f58)[0x7f819ad67f58]
[12c7ad3a43b8:00249] [15] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow16NewLocalExecutorERKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS5_EEPPNS_8ExecutorE+0x6b)[0x7f819ad6957b]
[12c7ad3a43b8:00249] [16] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x114360d)[0x7f819ad6960d]
[12c7ad3a43b8:00249] [17] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow11NewExecutorERKSsRKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS7_EEPS5_INS_8ExecutorES8_ISB_EE+0x66)[0x7f819ad69e56]
[12c7ad3a43b8:00249] [18] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession15CreateExecutorsERKNS_15CallableOptionsEPSt10unique_ptrINS0_16ExecutorsAndKeysESt14default_deleteIS5_EEPS4_INS0_12FunctionInfoES6_ISA_EEPNS0_12RunStateArgsE+0xd31)[0x7f81a1c53cd1]
[12c7ad3a43b8:00249] [19] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession12MakeCallableERKNS_15CallableOptionsEPx+0x129)[0x7f81a1c565a9]
[12c7ad3a43b8:00249] [20] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10SessionRef12MakeCallableERKNS_15CallableOptionsEPx+0x31d)[0x7f821899cfed]
[12c7ad3a43b8:00249] [21] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xec3a2)[0x7f82189963a2]
[12c7ad3a43b8:00249] [22] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x8db9a)[0x7f8218937b9a]
[12c7ad3a43b8:00249] [23] python(PyCFunction_Call+0xfa)[0x5f5bda]
[12c7ad3a43b8:00249] [24] python(_PyObject_MakeTpCall+0x296)[0x5f6706]
[12c7ad3a43b8:00249] [25] python(_PyEval_EvalFrameDefault+0x5db3)[0x571143]
[12c7ad3a43b8:00249] [26] python(_PyFunction_Vectorcall+0x1b6)[0x5f5ee6]
[12c7ad3a43b8:00249] [27] python[0x59c39d]
[12c7ad3a43b8:00249] [28] python(_PyObject_MakeTpCall+0x1ff)[0x5f666f]
[12c7ad3a43b8:00249] [29] python(_PyEval_EvalFrameDefault+0x5db3)[0x571143]
[12c7ad3a43b8:00249] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 12c7ad3a43b8 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL
2023-10-04 11:14:03,443 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

rferrandis · October 4, 2023, 2:09pm

Hi,

I started from scratch several times, I used Docker within the WSL and Docker Desktop for Windows and nothing changed.

Then I tried another different model, YOLOv4, same strategy and steps, and this model is training, it seems to be something regarding the model itself. Could be something about my GPU’s?

Wed Oct  4 16:08:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.23                 Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti   WDDM  | 00000000:3B:00.0 Off |                  N/A |
| 30%   36C    P8              20W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti   WDDM  | 00000000:AF:00.0 Off |                  N/A |
| 30%   35C    P8               9W / 250W |    414MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A2000 12GB        WDDM  | 00000000:D8:00.0 Off |                  Off |
| 30%   35C    P8               4W /  70W |     51MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Update: the YoloV4 model training stoped just after first epoch finished:

INFO: Starting Training Loop.
Epoch 1/80
842/842 [==============================] - 513s 610ms/step - loss: 17319.5907
INFO: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0/_16717]]
0 successful operations.
0 derived errors ignored.

rferrandis · October 5, 2023, 5:58am

Hi,

Please, any help?

Edit: I disabled two of the 3 GPUs and it worked! It seems that having multiple GPUs in the host PC, even without using them in the training command (–GPUS 1) affects in some way to the process. I tried with just the two GeForce RTX 2080 Ti alone, but with same results.

So, there is something else that I need to do if I want to use several GPUs on the same PC?

Morganh · October 5, 2023, 2:27pm

Glad to know 1 gpu works on WSL.
For multigpu on WSL, please check if NCCL test can work.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

More info for WSL, you can search in tao forum. For example,
TLT 3.0 & WSL2 issues - #11 by Morganh.

Morganh · October 10, 2023, 7:13am

From the nvidia-smi, your machine has 3gpus. You installed Windows system on this machine. And currently, you installed WSL on the Windows system, right?

rferrandis · October 10, 2023, 7:36am

That’s right.

I conducted more tests over the weekend. I installed Ubuntu 20.04, and training with two GPUs proceeded without any issues. This leads me to believe that the problem lies with WSL, the Nvidia drivers, or CUDA.

Now that I have TAO running smoothly on an Ubuntu server, several test ideas come to mind. I appreciate your patience in advance – I’m sure I’ll have many questions!

system · October 24, 2023, 7:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Docker instantiation failed when running tao ssd TAO Toolkit	17	1090	December 28, 2021
WSL2 & TAO issues TAO Toolkit wsl , tao	27	4022	January 5, 2022
"Unable to find image 'nvidia/cuda" While Installing TAO Toolkit TAO Toolkit cuda , docker , ai-training , wsl , training , installation , tao , rtx	12	8062	November 2, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	760	July 17, 2023
TLT 3.0 & WSL2 issues TAO Toolkit nvbugs	7	1317	December 6, 2021
More than 1 GPU not working using Tao Train TAO Toolkit	47	5151	April 9, 2023
While installing TAO, ERROR: nvidia-docker not found TAO Toolkit tao	10	221	August 8, 2025
Problem with tlt file mounting TAO Toolkit	29	2631	January 6, 2022
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	323	August 16, 2024
Error when training with multiple GPUs in TAO TAO Toolkit	17	2157	May 4, 2023

Jupyter Notebook SSD error

Related topics