why received signal SIGSEGV when import deploy.prototxt with Tensor RT 3.0

we trained a caffe network and model, and want to use them to do inference, but segment fault occure.

Thread 1 "zftech_detectne" received signal SIGSEGV, Segmentation fault.
0x00007fffec5f7a52 in nvinfer1::Network::validate(nvinfer1::cudnn::HardwareContext const&, bool, bool, int) const ()
   from /usr/lib/x86_64-linux-gnu/libnvinfer.so.4
(gdb) bt
#0  0x00007fffec5f7a52 in nvinfer1::Network::validate(nvinfer1::cudnn::HardwareContext const&, bool, bool, int) const ()
   from /usr/lib/x86_64-linux-gnu/libnvinfer.so.4
#1  0x00007fffec5e4ce6 in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::cudnn::HardwareContext const&, nvinfer1::Network const&)
    () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.4
#2  0x00007fffec5c1e11 in nvinfer1::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.4
#3  0x00000000004047a9 in caffeToGIEModel (deployFile="deploy.prototxt",
    modelFile="snapshot_iter_37100.caffemodel",
    outputs=std::vector of length 1, capacity 1 = {...}, maxBatchSize=4,
    gieModelStream=@0x7fffffffe248: 0x0) at zftechDetectNet.cpp:98
#4  0x0000000000404f17 in main (argc=1, argv=0x7fffffffe3f8)
    at zftechDetectNet.cpp:153

Our platform is Ubuntu 16.04 + TensorRT 3.0 + CUDA 9.0 + Tesla P4 + Driver 384.90.
Why Network::validate failed? AastaAll mention in https://devtalk.nvidia.com/default/topic/1008935/jetson-tx1/error-with-concatenate-layer-in-tensorrt2/1 that code will be the same but please re-compile your model with aarch64 tensorRT library to make it compatible to TX1 GPU architecture., can .prototxt and .caffemodel file trained by one platform, but deploy to another one? Embed system always use cross-compile, how about different TensorRT or CUDA version?
The files we try to import is seems too big to upload as forum attachment, we will keep contact with this issue. Thanks.
snapshot_iter.zip (21.2 MB)
detectnet.zip (21.2 MB)

When I uploading attachment first time, there is no progress bar, the bar appear after the first file upload successful. The action is not predictable if I upload another one before the previous one finish uploading, I mean they will fail together.
the Attachment div disappear if I click “edit” and “save changes”, I need to refresh the forum page to see them again.
The attachment problem can be bypass, the reply can be ignored.

Since I can’t delete the replay, I just let this reply blank.

Hi,

Could you share some information about your training environment?

TensorRT can’t support the model trained by NvCaffe-0.16.
The serialization method in NvCaffe-0.16 changes and TensorRT doesn’t support it yet.

Please use NvCaffe-0.15 to avoid this incompatibility issue.
Thanks

The training environment is setup by some one else, Is there any method to retrieve NvCaffe version from .prototxt/.caffemodel file?
Anyway, we setup a NvCaffe 0.15.14 enviroment and re-training on it, hope it can load by TensorRT 3.x, thanks.

Since we failed again, we need provide more training environment information.

Our NvCaffe is 0.15.14 now, but we train standard DetectNet with DIGITS 6.1.0 environment. It seems the DIGITS model will deploy on Jetson TX2, but we using Tesla P4 now.

The “googlenet.caffemodel” is 40MB in TensorRT 3.0 sample, but we trained DetectNet “snapshot_iter_3100.caffemodel” file is only 23MB, should we train GoogleNet with pure NvCaffe for Tesla P4, can’t use DIGITS now or need some setup?

One fellow of our team has successful running jetson-reference with detection function in Ubuntu 16.04 + K2200, I’m requiring for the .prototxt/.caffemodel file which one he succeed running inference, and will test with these file again.
result: the same .prototxt/.caffemodel file can run inference in jetson-reference PC environment import to my Tesla P4 environment running with segment fault.
Why standard detectnet in DIGITS can’t deploy to TensorRT 3.0 + Tesla P4 except NvCaffe version?

Hi haifengli, Python layers need removed from DIGITS detectnet (after training and before loading with TensorRT). Please refer to this step: https://github.com/dusty-nv/jetson-inference#detectnet-patches-for-tensorrt

We have done removing Python layer steps with # comment them out, still got no success. The other fellow used the DIGITS trained mnist lenet, and it running success in TensorRT.
I uploaded the last files as attachment we used which cause this topic’s problem.
devtalk.zip (21.8 MB)

Hi,

Sorry for the late reply.

I have checked your model on a P4 server, and it can run correctly.
Could you double check?

vickteam@p4station:~/TensorRT2.1/TensorRT-2.1.2/bin$ ./giexec --deploy=deploy.prototxt --output=bboxes --output=coverage
deploy: deploy.prototxt
output: bboxes
output: coverage
Input "data": 3x360x640
Output "bboxes": 4x22x40
Output "coverage": 1x22x40
name=data, bindingIndex=0, buffers.size()=3
name=bboxes, bindingIndex=1, buffers.size()=3
name=coverage, bindingIndex=2, buffers.size()=3
Average over 10 runs is 5.87547 ms.
Average over 10 runs is 5.87309 ms.
Average over 10 runs is 5.87225 ms.
Average over 10 runs is 5.85517 ms.
Average over 10 runs is 5.85287 ms.
Average over 10 runs is 5.85774 ms.
Average over 10 runs is 5.85375 ms.
Average over 10 runs is 5.85751 ms.
Average over 10 runs is 5.85822 ms.
Average over 10 runs is 5.86075 ms.

Thanks.

I tried you command with and without --model option, they all are okay. Then I recheck you command line option with my C++ code, and found:

I set the output name wrong with layer node, it need be output blob name. I haven’t see this as the layer and blob name both call “prob” in “googlenet.prototxt” file.

Thank you very much as we found the wrong code with your help.