Caffe failed with py-faster-rcnn demo.py on TX1

I want to demo the faster rcnn on Tx1.

https://github.com/rbgirshick/py-faster-rcnn

After solving many problems, there is still an error with gpu_nms.
Does someone also want to try this caffe neuron network on Tx1?
Please give me a hand, if you have solved this problem.
Many thanks.

I ran with python 3.5.1.
Caffe was perfectly installed with cuda8.0 and opencv 3.1.0.

Traceback (most recent call last):
  File "tools/demo.py", line 18, in <module>
    from fast_rcnn.test import im_detect
  File "/home/ubuntu/Desktop/py-faster-rcnn/tools/../lib/fast_rcnn/test.py", line 17, in <module>
    from fast_rcnn.nms_wrapper import nms
  File "/home/ubuntu/Desktop/py-faster-rcnn/tools/../lib/fast_rcnn/nms_wrapper.py", line 9, in <module>
    from nms.gpu_nms import gpu_nms
ImportError: /home/ubuntu/Desktop/py-faster-rcnn/tools/../lib/nms/gpu_nms.so: undefined symbol: _Py_ZeroStruct

How to solve this _Py_ZeroStruct error?

Hi,

Faster-rcnn has their own caffe repo (contains some self-implemented layers) and it is required to compile caffe nested in py-faster-rcnn rather than BVLC caffe.

But faster-rcnn can work WELL on jetson tx1 with 24.2. You can follow this:

  1. Use Jetpack to install CUDA-8.0, cuDNN v5.1, opencv4Tegra 2.4.13

  2. Clone code

git clone --recursive https://github.com/rbgirshick/py-faster-rcnn.git
  1. caffe and pycaffe
  • Sync following file to BVLC caffe in order to support cuDNNv5
    -> You can just git clone BVLC caffe and replace folloing file directly

include/caffe/util/:
cudnn.hpp

src/caffe/layers/:
cudnn_conv_layer.cu
cudnn_relu_layer.cpp
cudnn_relu_layer.cu
cudnn_sigmoid_layer.cpp
cudnn_sigmoid_layer.cu
cudnn_tanh_layer.cpp
cudnn_tanh_layer.cu

include/caffe/layers/:
cudnn_relu_layer.hpp
cudnn_sigmoid_layer.hpp
cudnn_tanh_layer.hpp

  • Modify configure
cp Makefile.config.example Makefile.config

edit Makefile.config

+++ USE_CUDNN := 1
+++ WITH_PYTHON_LAYER := 1
--- INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
+++ INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial/

edit Makefile

--- LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5
+++ LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial
  • Build
make
make pycaffe
  1. faster RCNN
cd $FRCN_ROOT/lib
make
cd $FRCN_ROOT
./data/scripts/fetch_faster_rcnn_models.sh
cd $FRCN_ROOT
./tools/demo.py
  1. Possible error
    numpy error:
sudo apt-get install python-numpy

cython error:

sudo apt-get install cython

checksum incorrect error when run data/scripts/fetch_faster_rcnn_models.sh
edit data/scripts/fetch_faster_rcnn_models.sh

--- wget $URL -O $FILE
+++ wget --no-check-certificate $URL -O $FILE

ImportError: No module named easydict:

sudo apt-get install python-pip
sudo pip install easydict

locale.Error: unsupported locale setting:

sudo apt-get install language-pack-id
export LC_ALL="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"
sudo dpkg-reconfigure locales

ImportError: No module named cv2:

sudo apt-get install python-opencv

ImportError: No module named skimage.io:

sudo apt-get install libfreetype6-dev
sudo pip install scikit-image

‘GDK_IS_DISPLAY (display)’ failed:

sudo apt-get install libv4l-dev
sudo apt-get install xorg
export DISPLAY=:1
startx &

ImportError: No module named google.protobuf.internal:

sudo apt-get install python-protobuf

ImportError: Cairo backend requires that cairocffi or pycairo is installed:

sudo apt-get install python-dev
sudo apt-get install libffi-dev
sudo pip install cffi
sudo pip install cairocffi

ImportError: No module named yaml:

sudo pip install pyyaml

Thanks for your help, AastaLLL!!

Something in the $FRCN_ROOT/lib still cause the error.

I am now using python 3.5.1. I have gone through all the steps above.
But the _Py_ZeroStruct error still exists.

in $FRCN_ROOT/lib/setup.py

replace line 123, 140with
+++include_dirs = [’/usr/lib/python3/dist-packages/numpy/core/include’,’/usr/include/python3.5m’]

can solve this _Py_ZeroStruct problem

After running, some errors occur

Loaded network /home/ubuntu/Desktop/py-faster-rcnn/data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel
Killed

How do I solve this?

Switch the model to zf models can skip this error.

python3 tools/demo.py --net zf

But I still want to know why VGG16 model can’t work on Tx1.

I want to ask NVIDIA engineer a question.
Is it possible for me to get the tensorRT to train RCNN net?
I have learned that tensorRT can run faster than caffe on TX1.
Could you help me pass the verification?

Hi,

TensorRT targets for accelerating inference time, so it doesn’t support back-propagation and can’t use for training.

If you want to train networks on desktop, it’s good choice to try DIGITs.

Hi,

I think VGG16’s problem is caused by out of gpu memory since it takes about 2360Mb gpu memory for executing.
Maybe you need to close all unnecessary program to free enough gpu memory for this model.
It can be loaded successfully in my environment and I only run demo.py.

You can use this program to query gpu memory usage information

#include <iostream>
#include <unistd.h>
#include "cuda.h"

int main()
{
    // show memory usage of GPU
    size_t free_byte ;
    size_t total_byte ;

    while (true )
    {
        cudaError_t cuda_status = cudaMemGetInfo( &free_byte, &total_byte ) ;

        if ( cudaSuccess != cuda_status ){
            std::cout << "Error: cudaMemGetInfo fails, " << cudaGetErrorString(cuda_status) << std::endl;
            exit(1);
        }

        double free_db = (double)free_byte ;
        double total_db = (double)total_byte ;
        double used_db = total_db - free_db ;

        std::cout << "GPU memory usage: used = " << used_db/1024.0/1024.0 << ", free = "
                  << free_db/1024.0/1024.0 << " MB, total = " << total_db/1024.0/1024.0 << " MB" << std::endl;
        sleep(1);
    }

    return 0;
}

compile with

nvcc test.cu -o test

run as

./test

Thanks for your help. It’s very useful. Really appreciate!!

DIGITS really draws my attention. I think it a highly intuitive model for training.
Here are some basic questions I really want to know.

  1. Can I use DIGITS to train complex caffe model, like RCNN?
  2. Can its result be integrated into many brand of training network, like theano, tensorflow and so on?

Sorry i am new, but I really want to get the know-how. Maybe, you can get me through.
Many thanks.

Hi,

DIGITs not support r-cnn currently but it includes DetectNet, a network can predict object bounding boxes directly.
If your use-case is object localization or object detection, it’s a good choice to use DetectNet.

Train it on DIGITs:

Inference with this sample code which use tensorRT to speed-up inference time. ( it takes caffemodel as input )

DIGITs currently can’t output other framework’s model. But you can use third-party’s program to convert the caffemodel to your target network. For example, search ‘caffe model to tensorflow’, there are several program can help you achieve this.

Feel free to ask any question here : )

Hi,
I did the step until make in *Build step and I got the following:

PROTOC src/caffe/proto/caffe.proto
CXX .build_release/src/caffe/proto/caffe.pb.cc
CXX src/caffe/syncedmem.cpp
In file included from src/caffe/syncedmem.cpp:1:0:
./include/caffe/common.hpp:4:32: fatal error: boost/shared_ptr.hpp: No such file or directory
compilation terminated.
Makefile:564: recipe for target ‘.build_release/src/caffe/syncedmem.o’ failed
make: *** [.build_release/src/caffe/syncedmem.o] Error 1

Hi,

Thanks for your feedback.

This is related to the boost library. Please try this command to check if it is okay.

sudo apt-get install --no-install-recommends libboost-all-dev

Thanks and please let us know the result.

Thanks @AastaLLL
It is still the same Error may that because apt-get update in JetPack 2.3 gave me the Following:

W: file:///var/cuda-repo-8-0-local/Release.gpg: Signature by key 889BEE522DA690103C4B085ED88C3D385C37D3BE uses weak digest algorithm (SHA1)
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_cuda-repo-8-0-local_Release
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_libopencv4tegra-repo_Release
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_nv-gie-repo-6-rc-cuda8.0_Release
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_visionworks-repo_Release
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_visionworks-sfm-repo_Release
W: Invalid ‘Date’ entry in Release file /var/lib/apt/lists/_var_visionworks-tracking-repo_Release
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/dists/xenial/main/binary-arm64/Packages 404 Not Found [IP: 91.189.91.23 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/dists/xenial-updates/main/binary-arm64/Packages 404 Not Found [IP: 91.189.91.23 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/dists/xenial-backports/main/binary-arm64/Packages 404 Not Found [IP: 91.189.91.23 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.

Hi,

Thanks for your feedback.
This invalid Date warning can be safely skipped.

Let’s track further installation on topic 1004976:
https://devtalk.nvidia.com/default/topic/1004976/jetson-tx1/faster-r-cnn-on-jetson-tx1/

Thanks.

Though this is on the TX2 flashed with JetPack3.1, let me ask your help.

  1. After installing faster-RCNN, I could run demo.py with the test images. However it does not show the images. The error message is: “Couldn’t find foreign struct converter for ‘cairo.Context’”

  2. With the same setting, I also tried to run a video application and failed with the error message
    “Gtk-ERROR **: GTK+ 2.x symbols detected. Using GTK+ 2.x and GTK+ 3 in the same process is not supported”

Hi h.ito,

Please file your issue into TX2 borad: https://devtalk.nvidia.com/default/board/188/jetson-tx2/

Thanks