Jupyter Notebook SSD error

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Windows 10 + WSL + DOCKER + GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Object detection SSD
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi,

I’m facing some issues with the Jupyter Notebooks, and I’m starting to think that maybe I am doing something wrong:

  • I started with the notebooks included in the getting started package, but they seemed to be out of date. Then I found a post where someone pointed to a direct download of the TAO 5.0 updated notebooks, but they have the same errors.

  • I am trying to run the SSD example, and the first issue that I encounter is that in the step for installing TAO (!pip3 install nvidia-tao), it installs the 4.0 version. This is what TAO info says:
    Configuration of the TAO Toolkit Instance
    dockers: [‘nvidia/tao/tao-toolkit’]
    format_version: 2.0
    toolkit_version: 4.0.1
    published_date: 03/06/2023

This is strange for me, but I tried to finish the example.

  • The next error that I encountered is when it tries to convert the dataset, the command is not working:

!tao model ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/ssd/tfrecords/kitti_train
-r $USER_EXPERIMENT_DIR/

It works if you remove “model” from it, but is this how it is supposed to work? Or are there other changes I need to be aware of?

I am asking this because if I keep going, when I try to train the model It downloads the docker 4.0-tf1 image, and it gives me some segmentation errors, so I am not sure if it is supposed to work like this or I’m doing something wrong.

And last thing, the key needed to run the training is nvidia_tlt or my NGC key?

Hi,

Someone pointed in other post that the TAO 5.0 version is installed if you create an environment with Python 3.7 instead of 3.6. I started from scratch and now the commands from the notebook are working as expected, but please, could you update the quickstart guide software requirements?

python     >=3.6.9<3.7      Not needed if you use TAO toolkit API

The bad news, I have the exact same error while trying to train the model:

DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node af56a70c5a4c exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL

Thanks for the info. Will check further and update.

Could you share the spec file? More, did you ever run official 5.0 SSD notebook successfully?

1 Like

Hi,

This is the first time that I’m trying to train a model with TAO. The spec file is this one: !cat $LOCAL_SPECS_DIR/ssd_tfrecords_kitti_train.txt

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/kitti_split/training"
  image_dir_name: "image"
  label_dir_name: "label"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 0
  num_shards: 10
}
image_directory_path: "/workspace/tao-experiments/data/kitti_split/training"
target_class_mapping {
    key: "car"
    value: "car"
}
target_class_mapping {
    key: "pedestrian"
    value: "pedestrian"
}
target_class_mapping {
    key: "cyclist"
    value: "cyclist"
}
target_class_mapping {
    key: "van"
    value: "car"
}
target_class_mapping {
    key: "person_sitting"
    value: "pedestrian"
}

Sorry if this is not what you asked for, I’m a bit lost here…

The 5.0 notebook is https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/ssd/ssd.ipynb.
You can running commands similar to
!tao model ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/ssd/tfrecords/kitti_train
-r $USER_EXPERIMENT_DIR/

The example spec file is: https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/ssd/specs/ssd_tfrecords_kitti_train.txt

I am afraid your images path if not available inside docker.
Please double check ~/tao_mounts.json file for the mapping.
You can also run command to check.
$ tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training

Hi Morganh,

The spec file is the same, an if I execute the command

!tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training

It shows the folders Image and Label, if I create a folder inside the training folder on the host PC it shows in the docker image too, and the image and label folders are full of images and txt files. The paths apparently are ok, right?

I’m going to check if the notebook is exactly the same, could be something regarding the user/login in the Docker Image? This is the tao_mounts.json:

{
    "Mounts": [
        {
            "source": "/home/eines/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/ssd/workspace",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/eines/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/ssd/specs",
            "destination": "/workspace/tao-experiments/ssd/specs"
        }
    ],
    "DockerOptions": {
        "user": "1000:1001"
    }
}

Another question, I’m using WSL with ubuntu 20.04, but the docker image is in the docker desktop for windows with WSL integration, and it seems to work, but it is the correct architecture? Or do I need to install Docker-ce within WSL?

Hi,

This is the error (Invalid permissions?):

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
[12c7ad3a43b8:00249] *** Process received signal ***
[12c7ad3a43b8:00249] Signal: Segmentation fault (11)
[12c7ad3a43b8:00249] Signal code: Invalid permissions (2)
[12c7ad3a43b8:00249] Failing at address: 0x800000019
[12c7ad3a43b8:00249] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f821db17090]
[12c7ad3a43b8:00249] [ 1] /usr/lib/wsl/drivers/nv_dispig.inf_amd64_7e5fd280efaa5445/libcuda.so.1.1(+0x24c3d0)[0x7f804e47d3d0]
[12c7ad3a43b8:00249] [ 2] /usr/lib/wsl/drivers/nv_dispig.inf_amd64_7e5fd280efaa5445/libcuda.so.1.1(+0x2c768f)[0x7f804e4f868f]
[12c7ad3a43b8:00249] [ 3] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x5fa2f0)[0x7f817eecd2f0]
[12c7ad3a43b8:00249] [ 4] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x658158)[0x7f817ef2b158]
[12c7ad3a43b8:00249] [ 5] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(daliDeletePipeline+0x3c)[0x7f817eab426c]
[12c7ad3a43b8:00249] [ 6] /usr/local/lib/python3.8/dist-packages/nvidia/dali_tf_plugin/libdali_tf_1_15.so(_ZN12dali_tf_impl6DaliOpD0Ev+0x54)[0x7f8129811ea4]
[12c7ad3a43b8:00249] [ 7] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow14CreateOpKernelENS_10DeviceTypeEPNS_10DeviceBaseEPNS_9AllocatorEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0x98d)[0x7f819aab11cd]
[12c7ad3a43b8:00249] [ 8] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow21CreateNonCachedKernelEPNS_6DeviceEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0xf2)[0x7f819ad58f52]
[12c7ad3a43b8:00249] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPNS_22FunctionLibraryRuntimeEPPNS_8OpKernelE+0x9a3)[0x7f819ad79003]
[12c7ad3a43b8:00249] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPPNS_8OpKernelE+0x18)[0x7f819ad793f8]
[12c7ad3a43b8:00249] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(+0x60356a3)[0x7f81a1c416a3]
[12c7ad3a43b8:00249] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow9OpSegment12FindOrCreateERKSsS2_PPNS_8OpKernelESt8functionIFNS_6StatusES5_EE+0x1ba)[0x7f819aab23ba]
[12c7ad3a43b8:00249] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(+0x6035c82)[0x7f81a1c41c82]
[12c7ad3a43b8:00249] [14] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1141f58)[0x7f819ad67f58]
[12c7ad3a43b8:00249] [15] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow16NewLocalExecutorERKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS5_EEPPNS_8ExecutorE+0x6b)[0x7f819ad6957b]
[12c7ad3a43b8:00249] [16] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x114360d)[0x7f819ad6960d]
[12c7ad3a43b8:00249] [17] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow11NewExecutorERKSsRKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS7_EEPS5_INS_8ExecutorES8_ISB_EE+0x66)[0x7f819ad69e56]
[12c7ad3a43b8:00249] [18] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession15CreateExecutorsERKNS_15CallableOptionsEPSt10unique_ptrINS0_16ExecutorsAndKeysESt14default_deleteIS5_EEPS4_INS0_12FunctionInfoES6_ISA_EEPNS0_12RunStateArgsE+0xd31)[0x7f81a1c53cd1]
[12c7ad3a43b8:00249] [19] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession12MakeCallableERKNS_15CallableOptionsEPx+0x129)[0x7f81a1c565a9]
[12c7ad3a43b8:00249] [20] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10SessionRef12MakeCallableERKNS_15CallableOptionsEPx+0x31d)[0x7f821899cfed]
[12c7ad3a43b8:00249] [21] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xec3a2)[0x7f82189963a2]
[12c7ad3a43b8:00249] [22] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x8db9a)[0x7f8218937b9a]
[12c7ad3a43b8:00249] [23] python(PyCFunction_Call+0xfa)[0x5f5bda]
[12c7ad3a43b8:00249] [24] python(_PyObject_MakeTpCall+0x296)[0x5f6706]
[12c7ad3a43b8:00249] [25] python(_PyEval_EvalFrameDefault+0x5db3)[0x571143]
[12c7ad3a43b8:00249] [26] python(_PyFunction_Vectorcall+0x1b6)[0x5f5ee6]
[12c7ad3a43b8:00249] [27] python[0x59c39d]
[12c7ad3a43b8:00249] [28] python(_PyObject_MakeTpCall+0x1ff)[0x5f666f]
[12c7ad3a43b8:00249] [29] python(_PyEval_EvalFrameDefault+0x5db3)[0x571143]
[12c7ad3a43b8:00249] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 12c7ad3a43b8 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL
2023-10-04 11:14:03,443 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Hi,

I started from scratch several times, I used Docker within the WSL and Docker Desktop for Windows and nothing changed.

Then I tried another different model, YOLOv4, same strategy and steps, and this model is training, it seems to be something regarding the model itself. Could be something about my GPU’s?

Wed Oct  4 16:08:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.23                 Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti   WDDM  | 00000000:3B:00.0 Off |                  N/A |
| 30%   36C    P8              20W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti   WDDM  | 00000000:AF:00.0 Off |                  N/A |
| 30%   35C    P8               9W / 250W |    414MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A2000 12GB        WDDM  | 00000000:D8:00.0 Off |                  Off |
| 30%   35C    P8               4W /  70W |     51MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Update: the YoloV4 model training stoped just after first epoch finished:

INFO: Starting Training Loop.
Epoch 1/80
842/842 [==============================] - 513s 610ms/step - loss: 17319.5907
INFO: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0/_16717]]
0 successful operations.
0 derived errors ignored.

Hi,

Please, any help?

Edit: I disabled two of the 3 GPUs and it worked! It seems that having multiple GPUs in the host PC, even without using them in the training command (–GPUS 1) affects in some way to the process. I tried with just the two GeForce RTX 2080 Ti alone, but with same results.

So, there is something else that I need to do if I want to use several GPUs on the same PC?

Glad to know 1 gpu works on WSL.
For multigpu on WSL, please check if NCCL test can work.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

More info for WSL, you can search in tao forum. For example,
TLT 3.0 & WSL2 issues - #11 by Morganh.

From the nvidia-smi, your machine has 3gpus. You installed Windows system on this machine. And currently, you installed WSL on the Windows system, right?

That’s right.

I conducted more tests over the weekend. I installed Ubuntu 20.04, and training with two GPUs proceeded without any issues. This leads me to believe that the problem lies with WSL, the Nvidia drivers, or CUDA.

Now that I have TAO running smoothly on an Ubuntu server, several test ideas come to mind. I appreciate your patience in advance – I’m sure I’ll have many questions!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.