I have a problem to activate the dataset_convert tool for the sample faster_rcnn downloaded from NGC as part of the TAO samples package.
I don’t think that it is a specific problem with the faster_rcnn network or the dataset_convert, but somthing with the docker activation on my specific machine.
As I tried to explain above, I used the nvidia/cuda:11.0-base docker just to verify that the docker together with nvidia runtime container were installed correctly as asked to perform in this link: nvidia runtime container
This sanity docker successfully run and show all NVIDIA driver information which I previously installed in my machine.
But when I’m trying to run the TAO docker faster_rcnn commands shuch as dataset_convert or train for example I got the error I mentioned above.
Hello,
It seems that there is a contention between the jupyter notebook sample which activated by default with port 8888 and the tao that try to use the same port for its docker images.
if I close the faster rcnn sample jupyter notebook and then activate the command:
tao faster_rcnn run /bin/bash
It seems that it activated, I’m getting the following:
tao faster_rcnn run /bin/bash
2021-11-07 11:42:48,112 [INFO] root: Registry: [‘nvcr.io’]
groups: cannot find name for group ID 1000
I have no name!@9eb0380d5eae:/workspace$
And now I can perform commands inside the docker.
I tried to activate the jupyter notebook with an arbitrary port value like this:
jupyter notebook --port=8889
And I saw that the opened link has the following address:
http://localhost:8889/notebooks/faster_rcnn.ipynb
And still I got the same error on the port 8888 despite the fact that I opened on port 8889:
tao faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt \
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
2021-11-07 11:51:50,450 [INFO] root: Registry: [‘nvcr.io’]
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“driver failed programming external connectivity on endpoint wonderful_varahamihira (6258e3bd73106ee3c12f179ccaff6922c1160de66a73cc57138afddf2e891d58): Bind for 0.0.0.0:8888 failed: port is already allocated”)
When you run into above, can you run below command successfully?
faster_rcnn dataset_convert --gpu_index $GPU_INDEX -d $SPECS_DIR/frcnn_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
Actually if I changed the Ports section in the ~/.tao_mounts.json from 8888 to 8889 for example.
All faster rcnn sample commands of the jupyter notebook are working well.
From my point of view, if I’m trying to summarise the topic:
There was a contention between jupyter notebook port and the tao mounts port, both of them tried to use 8888.
When I changed the port inside the mount file, all jupyter faster rcnn commands started to work OK.
Invalid decryption. Unable to open file (file signature not found). **The key used to load the model is incorrect**.
2021-11-08 13:30:39,504 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I already set the $KEY env variable and I used it successfully for previous sample commands.
Yes I’m using the pretrained model from ngc.
I download it using the sample command:
ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18
cp pretrained_object_detection_vresnet18/resnet_18.hdf5 $LOCAL_EXPERIMENT_DIR
rm -rf pretrained_object_detection_vresnet18
ls -rlt $LOCAL_EXPERIMENT_DIR
Please clarify what is the model card? How I know which key is the correct one?
OK, what you downloaded is a hdf5 format pretrained model. It is not a purpose-built model. So there is not specific ngc key.
Suggest you to double check the $KEY. It should be the same as the one you trained the model $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt.
More, please make sure $USER_EXPERIMENT_DIR/frcnn_kitti_resnet18.epoch12.tlt is available.
Thanks for the clarification.
I’m using the $KEY which I set to my own key value of my personal NGC account (A very long characters string).
Is it correct to do so?
Are these commands make the connection between the frcnn_kitti_resnet18.epoch12.tlt file with my $KEY value?
I realised that I didn’t execute them so I will clear all previous training results and first perform these command
which after I will retrain the model.
Yes, the command will replace with your own key. If you did not execute the command, you can check the existing key in the spec file default_spec_resnet18.txt. You can directly use it to run pruning,etc.