Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
Windows 10 + WSL + DOCKER + GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Object detection SSD
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hi,
I’m facing some issues with the Jupyter Notebooks, and I’m starting to think that maybe I am doing something wrong:
I started with the notebooks included in the getting started package, but they seemed to be out of date. Then I found a post where someone pointed to a direct download of the TAO 5.0 updated notebooks, but they have the same errors.
I am trying to run the SSD example, and the first issue that I encounter is that in the step for installing TAO (!pip3 install nvidia-tao), it installs the 4.0 version. This is what TAO info says:
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit’]
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023
This is strange for me, but I tried to finish the example.
The next error that I encountered is when it tries to convert the dataset, the command is not working:
!tao model ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/ssd/tfrecords/kitti_train
-r $USER_EXPERIMENT_DIR/
It works if you remove “model” from it, but is this how it is supposed to work? Or are there other changes I need to be aware of?
I am asking this because if I keep going, when I try to train the model It downloads the docker 4.0-tf1 image, and it gives me some segmentation errors, so I am not sure if it is supposed to work like this or I’m doing something wrong.
And last thing, the key needed to run the training is nvidia_tlt or my NGC key?
Someone pointed in other post that the TAO 5.0 version is installed if you create an environment with Python 3.7 instead of 3.6. I started from scratch and now the commands from the notebook are working as expected, but please, could you update the quickstart guide software requirements?
python >=3.6.9<3.7 Not needed if you use TAO toolkit API
The bad news, I have the exact same error while trying to train the model:
DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node af56a70c5a4c exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Execution status: FAIL
I am afraid your images path if not available inside docker.
Please double check ~/tao_mounts.json file for the mapping.
You can also run command to check.
$ tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training
The spec file is the same, an if I execute the command
!tao model ssd run ls /workspace/tao-experiments/data/kitti_split/training
It shows the folders Image and Label, if I create a folder inside the training folder on the host PC it shows in the docker image too, and the image and label folders are full of images and txt files. The paths apparently are ok, right?
I’m going to check if the notebook is exactly the same, could be something regarding the user/login in the Docker Image? This is the tao_mounts.json:
Another question, I’m using WSL with ubuntu 20.04, but the docker image is in the docker desktop for windows with WSL integration, and it seems to work, but it is the correct architecture? Or do I need to install Docker-ce within WSL?
I started from scratch several times, I used Docker within the WSL and Docker Desktop for Windows and nothing changed.
Then I tried another different model, YOLOv4, same strategy and steps, and this model is training, it seems to be something regarding the model itself. Could be something about my GPU’s?
Edit: I disabled two of the 3 GPUs and it worked! It seems that having multiple GPUs in the host PC, even without using them in the training command (–GPUS 1) affects in some way to the process. I tried with just the two GeForce RTX 2080 Ti alone, but with same results.
So, there is something else that I need to do if I want to use several GPUs on the same PC?
From the nvidia-smi, your machine has 3gpus. You installed Windows system on this machine. And currently, you installed WSL on the Windows system, right?
I conducted more tests over the weekend. I installed Ubuntu 20.04, and training with two GPUs proceeded without any issues. This leads me to believe that the problem lies with WSL, the Nvidia drivers, or CUDA.
Now that I have TAO running smoothly on an Ubuntu server, several test ideas come to mind. I appreciate your patience in advance – I’m sure I’ll have many questions!