Faster_RCNN sample dataset_convert command raise 'Docker instantiation failed with error: 500 Server Error: Internal Server Error'

orong13 · November 1, 2021, 3:31pm

I have a problem to activate the dataset_convert tool for the sample faster_rcnn downloaded from NGC as part of the TAO samples package.
I don’t think that it is a specific problem with the faster_rcnn network or the dataset_convert, but somthing with the docker activation on my specific machine.

• Hardware -
GeForce RTX 3090
Ubuntu 18.04 x64
NVIDIA driver 470.74
• Network Type - Faster_rcnn
• TLT Version:

Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. augment
2. bpnet
3. classification
4. detectnet_v2
5. dssd
6. emotionnet
7. faster_rcnn
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. converter
nvidia/tao/tao-toolkit-pyt:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. n_gram
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

• Training spec file - As provided by the TAO samples v1.2.0
• How to reproduce the issue:

Successfully perform all installations step described here:
tao_toolkit_quick_start_guide
The installed Docker version is:

Docker version 20.10.10, build b485636te
I also successfully checked the docker via:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Open faster_rcnn.ipynb using jupyter notbook command
Successfully perform all steps till the command:

!tao faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

The following error is rasied:

2021-11-01 17:10:17,233 [INFO] root: Registry: ['nvcr.io']
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("driver failed programming external connectivity on endpoint loving_bartik (10eba83b455ca5b712df335b610cd57c9f66c7657503fc3b2db3f52ce7f0e637): Error starting userland proxy: listen tcp4 0.0.0.0:8888: bind: address already in use")

When is perform this command:

docker container ls -a

I’m getting an empty table.

Please advise.

Morganh · November 1, 2021, 4:43pm

Why did you trigger nvidia/cuda:11.0-base docker? Can you trigger tao docker instead?

orong13 · November 2, 2021, 5:40am

Thanks for your quick response,

As I tried to explain above, I used the nvidia/cuda:11.0-base docker just to verify that the docker together with nvidia runtime container were installed correctly as asked to perform in this link:
nvidia runtime container

This sanity docker successfully run and show all NVIDIA driver information which I previously installed in my machine.

But when I’m trying to run the TAO docker faster_rcnn commands shuch as dataset_convert or train for example I got the error I mentioned above.

Morganh · November 2, 2021, 7:25am

To narrow down, please run in your host instead of Jupyter notebook.
$ tao faster_rcnn run /bin/bash

Then,
#dataset_convert -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

orong13 · November 3, 2021, 9:26am

Thanks,
I tried to perform the following command:

tao faster_rcnn run /bin/bash

And the following error raised:

2021-11-03 11:23:50,780 [INFO] root: Registry: ['nvcr.io']
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("driver failed programming external connectivity on endpoint sweet_raman (422a904fd1d9769589619deb6f5a93210d5a1062ebb85c60fa52e8f468483884): Error starting userland proxy: listen tcp4 0.0.0.0:8888: bind: address already in use")

So, for now I stopped and didn’t try to perform your offered dataset_convert command

Regard,

Morganh · November 3, 2021, 2:34pm

Please restart the docker and try again.
$ systemctl restart docker

orong13 · November 3, 2021, 3:30pm

Unfortunately restart docker didn’t help.
I’m still getting the same error.

Morganh · November 3, 2021, 3:50pm

Please search help via google. One topic , for example, Error response from daemon: driver failed programming external connectivity on endpoint nginx-proxy (669659d666e6b6164716c6009cc1f1b413f2130e8d6238db341769bce23620fa): Error starting userland proxy: Bind for 0.0.0.0:80: unexpected error (Failure EADDRINUSE) Error: failed to start containers: nginx-proxy · Issue #839 · nginx-proxy/nginx-proxy · GitHub

orong13 · November 7, 2021, 9:57am

Hello,
It seems that there is a contention between the jupyter notebook sample which activated by default with port 8888 and the tao that try to use the same port for its docker images.

if I close the faster rcnn sample jupyter notebook and then activate the command:

tao faster_rcnn run /bin/bash

It seems that it activated, I’m getting the following:

tao faster_rcnn run /bin/bash
2021-11-07 11:42:48,112 [INFO] root: Registry: [‘nvcr.io’]
groups: cannot find name for group ID 1000
I have no name!@9eb0380d5eae:/workspace$

And now I can perform commands inside the docker.

I tried to activate the jupyter notebook with an arbitrary port value like this:

jupyter notebook --port=8889

And I saw that the opened link has the following address:

http://localhost:8889/notebooks/faster_rcnn.ipynb

And still I got the same error on the port 8888 despite the fact that I opened on port 8889:

tao faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt \
                     -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

2021-11-07 11:51:50,450 [INFO] root: Registry: [‘nvcr.io’]
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“driver failed programming external connectivity on endpoint wonderful_varahamihira (6258e3bd73106ee3c12f179ccaff6922c1160de66a73cc57138afddf2e891d58): Bind for 0.0.0.0:8888 failed: port is already allocated”)

Morganh · November 7, 2021, 2:11pm

@orong13
Can you share your ~/.tao_mounts.json ? Thanks.

More, how did you trigger jupyter notebook previously? Running ’ $jupyter notebook --port=8888 " ?

orong13 · November 7, 2021, 2:40pm

This is my `/.tao_mounts.json:

{
    "Mounts": [
        {
            "source": "/home/Services/Dleware/Data/TAO/Experiments/data",
            "destination": "/workspace/tao-experiments/data"
        },
        {
            "source": "/home/Services/Dleware/Data/TAO/Experiments/faster_rcnn/Results",
            "destination": "/workspace/tao-experiments/results"
        },
        {
            "source": "/home/Installations/NVIDIA/TAO/Samples/cv_samples_v1.2.0/faster_rcnn/specs",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],
    "Envs": [
        {
            "variable": "CUDA_DEVICE_ORDER",
            "value": "PCI_BUS_ID"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

I used to activate the jupyter notebook from the fatser rcnn sample directory which includes the faster_rcnn.ipynb file:

jupyter notebook

Morganh · November 7, 2021, 3:33pm

orong13:

if I close the faster rcnn sample jupyter notebook and then activate the command:
tao faster_rcnn run /bin/bash
It seems that it activated, I’m getting the following:

tao faster_rcnn run /bin/bash
2021-11-07 11:42:48,112 [INFO] root: Registry: [‘nvcr.io’]
groups: cannot find name for group ID 1000
I have no name!@9eb0380d5eae:/workspace$

When you run into above, can you run below command successfully?
faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

orong13 · November 8, 2021, 7:15am

Yep,
I can run it.

Actually if I changed the Ports section in the ~/.tao_mounts.json from 8888 to 8889 for example.
All faster rcnn sample commands of the jupyter notebook are working well.

From my point of view, if I’m trying to summarise the topic:
There was a contention between jupyter notebook port and the tao mounts port, both of them tried to use 8888.
When I changed the port inside the mount file, all jupyter faster rcnn commands started to work OK.

Thanks,

Morganh · November 8, 2021, 7:17am

Thanks for the info. Appreciate your work!

orong13 · November 8, 2021, 1:20pm

Thank you for your support!

Now, When I tried to perform the following command:

!tao faster_rcnn prune --gpu_index $GPU_INDEX -m $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt \
           -o $USER_EXPERIMENT_DIR/model_1_pruned.tlt  \
           -eq union  \
           -pth 0.2 \
           -k $KEY

I’m getting the following error:

Invalid decryption. Unable to open file (file signature not found). **The key used to load the model is incorrect**.
2021-11-08 13:30:39,504 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I already set the $KEY env variable and I used it successfully for previous sample commands.

Please clarify which key shall I set?

Thanks,

Morganh · November 8, 2021, 1:34pm

Did you use pretrained model from ngc?
If yes, please make sure the key is correct in the model card.

orong13 · November 8, 2021, 1:45pm

Yes I’m using the pretrained model from ngc.
I download it using the sample command:

ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18
cp pretrained_object_detection_vresnet18/resnet_18.hdf5 $LOCAL_EXPERIMENT_DIR
rm -rf pretrained_object_detection_vresnet18
ls -rlt $LOCAL_EXPERIMENT_DIR

Please clarify what is the model card? How I know which key is the correct one?

Morganh · November 8, 2021, 2:04pm

OK, what you downloaded is a hdf5 format pretrained model. It is not a purpose-built model. So there is not specific ngc key.
Suggest you to double check the $KEY. It should be the same as the one you trained the model $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt.
More, please make sure $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt is available.

orong13 · November 8, 2021, 2:24pm

Thanks for the clarification.
I’m using the $KEY which I set to my own key value of my personal NGC account (A very long characters string).
Is it correct to do so?

I saw these commands:

!sed -i 's/$KEY/'"$KEY/g" $LOCAL_SPECS_DIR/default_spec_resnet18.txt
!cat $LOCAL_SPECS_DIR/default_spec_resnet18.txt

Are these commands make the connection between the frcnn_kitti_resnet18.epoch12.tlt file with my $KEY value?

I realised that I didn’t execute them so I will clear all previous training results and first perform these command
which after I will retrain the model.

Am I right?

Thanks,

Morganh · November 8, 2021, 2:27pm

Yes, the command will replace with your own key. If you did not execute the command, you can check the existing key in the spec file default_spec_resnet18.txt. You can directly use it to run pruning,etc.

Topic		Replies	Views
TAO faster_rcnn not working TAO Toolkit	19	555	February 22, 2022
Issue with tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit	17	1209	July 4, 2022
TAO toolkit happend some .so bug TAO Toolkit tao	19	952	September 9, 2022
Tao toolkit facenet Error TAO Toolkit	14	1331	March 7, 2022
Error in Generating TFrecords for yolov4 TAO Toolkit	38	1352	May 17, 2022
Tao toolkit container not installing TAO Toolkit	27	3144	June 20, 2022
Convert custom dataset using nvidia tao TAO Toolkit tao	2	461	June 14, 2023
TAO data services Error response from daemon: No such container dataset convert error from kitti to COCO TAO Toolkit	14	465	June 11, 2024
No CUDA-capable device is detected on tao detectnet_v2 dataset convert TAO Toolkit pycuda , omniverse_extension	13	6243	January 4, 2022
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	776	October 6, 2023

Faster_RCNN sample dataset_convert command raise 'Docker instantiation failed with error: 500 Server Error: Internal Server Error'

Related topics