Faster_RCNN sample dataset_convert command raise 'Docker instantiation failed with error: 500 Server Error: Internal Server Error'

I have a problem to activate the dataset_convert tool for the sample faster_rcnn downloaded from NGC as part of the TAO samples package.
I don’t think that it is a specific problem with the faster_rcnn network or the dataset_convert, but somthing with the docker activation on my specific machine.

• Hardware -
GeForce RTX 3090
Ubuntu 18.04 x64
NVIDIA driver 470.74
• Network Type - Faster_rcnn
• TLT Version:

Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. augment
2. bpnet
3. classification
4. detectnet_v2
5. dssd
6. emotionnet
7. faster_rcnn
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. converter
nvidia/tao/tao-toolkit-pyt:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. n_gram
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

• Training spec file - As provided by the TAO samples v1.2.0
• How to reproduce the issue:

  1. Successfully perform all installations step described here:
    tao_toolkit_quick_start_guide
    The installed Docker version is:

Docker version 20.10.10, build b485636te
I also successfully checked the docker via:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

  1. Open faster_rcnn.ipynb using jupyter notbook command

  2. Successfully perform all steps till the command:

!tao faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

The following error is rasied:

2021-11-01 17:10:17,233 [INFO] root: Registry: ['nvcr.io']
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("driver failed programming external connectivity on endpoint loving_bartik (10eba83b455ca5b712df335b610cd57c9f66c7657503fc3b2db3f52ce7f0e637): Error starting userland proxy: listen tcp4 0.0.0.0:8888: bind: address already in use")

When is perform this command:

docker container ls -a

I’m getting an empty table.

Please advise.

Why did you trigger nvidia/cuda:11.0-base docker? Can you trigger tao docker instead?

Thanks for your quick response,

As I tried to explain above, I used the nvidia/cuda:11.0-base docker just to verify that the docker together with nvidia runtime container were installed correctly as asked to perform in this link:
nvidia runtime container

This sanity docker successfully run and show all NVIDIA driver information which I previously installed in my machine.

But when I’m trying to run the TAO docker faster_rcnn commands shuch as dataset_convert or train for example I got the error I mentioned above.

To narrow down, please run in your host instead of Jupyter notebook.
$ tao faster_rcnn run /bin/bash

Then,
#dataset_convert -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

Thanks,
I tried to perform the following command:

tao faster_rcnn run /bin/bash

And the following error raised:

2021-11-03 11:23:50,780 [INFO] root: Registry: ['nvcr.io']
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("driver failed programming external connectivity on endpoint sweet_raman (422a904fd1d9769589619deb6f5a93210d5a1062ebb85c60fa52e8f468483884): Error starting userland proxy: listen tcp4 0.0.0.0:8888: bind: address already in use")

So, for now I stopped and didn’t try to perform your offered dataset_convert command

Regard,

Please restart the docker and try again.
$ systemctl restart docker

Unfortunately restart docker didn’t help.
I’m still getting the same error.

Please search help via google. One topic , for example, Error response from daemon: driver failed programming external connectivity on endpoint nginx-proxy (669659d666e6b6164716c6009cc1f1b413f2130e8d6238db341769bce23620fa): Error starting userland proxy: Bind for 0.0.0.0:80: unexpected error (Failure EADDRINUSE) Error: failed to start containers: nginx-proxy · Issue #839 · nginx-proxy/nginx-proxy · GitHub

Hello,
It seems that there is a contention between the jupyter notebook sample which activated by default with port 8888 and the tao that try to use the same port for its docker images.

if I close the faster rcnn sample jupyter notebook and then activate the command:

tao faster_rcnn run /bin/bash

It seems that it activated, I’m getting the following:

tao faster_rcnn run /bin/bash
2021-11-07 11:42:48,112 [INFO] root: Registry: [‘nvcr.io’]
groups: cannot find name for group ID 1000
I have no name!@9eb0380d5eae:/workspace$

And now I can perform commands inside the docker.

I tried to activate the jupyter notebook with an arbitrary port value like this:

jupyter notebook --port=8889

And I saw that the opened link has the following address:

http://localhost:8889/notebooks/faster_rcnn.ipynb

And still I got the same error on the port 8888 despite the fact that I opened on port 8889:

tao faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt \
                     -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

2021-11-07 11:51:50,450 [INFO] root: Registry: [‘nvcr.io’]
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“driver failed programming external connectivity on endpoint wonderful_varahamihira (6258e3bd73106ee3c12f179ccaff6922c1160de66a73cc57138afddf2e891d58): Bind for 0.0.0.0:8888 failed: port is already allocated”)

@orong13
Can you share your ~/.tao_mounts.json ? Thanks.

More, how did you trigger jupyter notebook previously? Running ’ $jupyter notebook --port=8888 " ?

This is my `/.tao_mounts.json:

{
    "Mounts": [
        {
            "source": "/home/Services/Dleware/Data/TAO/Experiments/data",
            "destination": "/workspace/tao-experiments/data"
        },
        {
            "source": "/home/Services/Dleware/Data/TAO/Experiments/faster_rcnn/Results",
            "destination": "/workspace/tao-experiments/results"
        },
        {
            "source": "/home/Installations/NVIDIA/TAO/Samples/cv_samples_v1.2.0/faster_rcnn/specs",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],
    "Envs": [
        {
            "variable": "CUDA_DEVICE_ORDER",
            "value": "PCI_BUS_ID"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

I used to activate the jupyter notebook from the fatser rcnn sample directory which includes the faster_rcnn.ipynb file:

jupyter notebook

When you run into above, can you run below command successfully?
faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

Yep,
I can run it.

Actually if I changed the Ports section in the ~/.tao_mounts.json from 8888 to 8889 for example.
All faster rcnn sample commands of the jupyter notebook are working well.

From my point of view, if I’m trying to summarise the topic:
There was a contention between jupyter notebook port and the tao mounts port, both of them tried to use 8888.
When I changed the port inside the mount file, all jupyter faster rcnn commands started to work OK.

Thanks,

1 Like

Thanks for the info. Appreciate your work!

Thank you for your support!

Now, When I tried to perform the following command:

!tao faster_rcnn prune --gpu_index $GPU_INDEX -m $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt \
           -o $USER_EXPERIMENT_DIR/model_1_pruned.tlt  \
           -eq union  \
           -pth 0.2 \
           -k $KEY

I’m getting the following error:

Invalid decryption. Unable to open file (file signature not found). **The key used to load the model is incorrect**.
2021-11-08 13:30:39,504 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I already set the $KEY env variable and I used it successfully for previous sample commands.

Please clarify which key shall I set?

Thanks,

Did you use pretrained model from ngc?
If yes, please make sure the key is correct in the model card.

Yes I’m using the pretrained model from ngc.
I download it using the sample command:

ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18
cp pretrained_object_detection_vresnet18/resnet_18.hdf5 $LOCAL_EXPERIMENT_DIR
rm -rf pretrained_object_detection_vresnet18
ls -rlt $LOCAL_EXPERIMENT_DIR

Please clarify what is the model card? How I know which key is the correct one?

OK, what you downloaded is a hdf5 format pretrained model. It is not a purpose-built model. So there is not specific ngc key.
Suggest you to double check the $KEY. It should be the same as the one you trained the model $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt.
More, please make sure $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt is available.

Thanks for the clarification.
I’m using the $KEY which I set to my own key value of my personal NGC account (A very long characters string).
Is it correct to do so?

I saw these commands:

!sed -i 's/$KEY/'"$KEY/g" $LOCAL_SPECS_DIR/default_spec_resnet18.txt
!cat $LOCAL_SPECS_DIR/default_spec_resnet18.txt

Are these commands make the connection between the frcnn_kitti_resnet18.epoch12.tlt file with my $KEY value?

I realised that I didn’t execute them so I will clear all previous training results and first perform these command
which after I will retrain the model.

Am I right?

Thanks,

Yes, the command will replace with your own key. If you did not execute the command, you can check the existing key in the spec file default_spec_resnet18.txt. You can directly use it to run pruning,etc.