object detection failed to run on TX2, based on tensorflow/modesl

Hi,
I’m trying to implement object detection on TX2.

I installed tensorflow on TX2 from sources:https://github.com/jetsonhacks/installTensorFlowTX2.I tried the code of tensorflow models(https://github.com/tensorflow/models/tree/master/research/object_detection) with pre-trained models. It worked with ssd_mobilenet_v1_coco model and ssd_inception_v2_coco model. But with other models like rfcn_resnet101_coco and faster_rcnn_resnet101_coco, the code worked on my pc with cpu but failed to launch on TX2.
I got the following errors:

2017-11-03 08:16:11.868600: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2017-11-03 08:16:11.868751: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x9f61400: CUDA_ERROR_LAUNCH_FAILED
2017-11-03 08:16:11.868794: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x9f61400: CUDA_ERROR_LAUNCH_FAILED
2017-11-03 08:16:11.868999: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED

Is there anything wrong with my driver?

Best regards!

Hi,

Could you help us check the memory status via tegrastats?

sudo ~/tegrastats

Please share the tegrastats data when this error occurs.
Thanks.

tegrastats data is as follows:

RAM 3091/7851MB (lfb 1015x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@1981,100%@1980,0%@348,0%@346,0%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3123/7851MB (lfb 1007x4MB) SWAP 0/8192MB (cached 0MB) cpu [1%@345,0%@2035,100%@2035,0%@345,0%@345,1%@349] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3138/7851MB (lfb 1003x4MB) SWAP 0/8192MB (cached 0MB) cpu [3%@345,0%@2036,100%@2034,0%@348,0%@345,0%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3154/7851MB (lfb 999x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@2035,100%@2034,0%@348,0%@345,0%@345] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3169/7851MB (lfb 995x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@2010,100%@2008,2%@346,0%@349,0%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3186/7851MB (lfb 991x4MB) SWAP 0/8192MB (cached 0MB) cpu [3%@345,0%@2035,100%@2036,0%@345,0%@348,1%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3217/7851MB (lfb 983x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@2006,100%@2006,0%@345,0%@348,0%@349] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3325/7851MB (lfb 957x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@2034,100%@2035,0%@345,0%@348,0%@349] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3326/7851MB (lfb 957x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@346,0%@2011,100%@2008,2%@348,0%@349,0%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3326/7851MB (lfb 956x4MB) SWAP 0/8192MB (cached 0MB) cpu [1%@345,0%@2010,100%@2008,0%@348,1%@348,0%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3327/7851MB (lfb 956x4MB) SWAP 0/8192MB (cached 0MB) cpu [1%@345,0%@2035,100%@2035,1%@348,1%@348,1%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3328/7851MB (lfb 956x4MB) SWAP 0/8192MB (cached 0MB) cpu [3%@345,0%@2010,100%@2005,0%@345,0%@348,0%@349] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3379/7851MB (lfb 943x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@2034,100%@2034,0%@348,0%@345,0%@345] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 3800/7851MB (lfb 838x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,0%@2009,100%@2012,0%@348,0%@349,0%@346] EMC 2%@1866 APE 150 GR3D 0%@114
RAM 4122/7851MB (lfb 757x4MB) SWAP 0/8192MB (cached 0MB) cpu [3%@345,0%@2034,100%@2034,0%@345,0%@348,0%@348] EMC 3%@1866 APE 150 GR3D 0%@114
RAM 3759/7851MB (lfb 821x4MB) SWAP 0/8192MB (cached 0MB) cpu [4%@345,0%@2034,100%@2036,0%@348,1%@346,0%@348] EMC 4%@1866 APE 150 GR3D 0%@114
RAM 4026/7851MB (lfb 782x4MB) SWAP 0/8192MB (cached 0MB) cpu [3%@345,0%@2010,100%@2036,0%@345,0%@348,0%@348] EMC 4%@1866 APE 150 GR3D 0%@114
RAM 4282/7851MB (lfb 715x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@2009,97%@2010,0%@348,0%@348,1%@349] EMC 4%@1866 APE 150 GR3D 0%@114
RAM 4528/7851MB (lfb 652x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@2035,100%@2009,1%@348,1%@349,0%@348] EMC 6%@1866 APE 150 GR3D 0%@114
RAM 4528/7851MB (lfb 652x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@2035,100%@2035,0%@345,0%@349,3%@348] EMC 10%@1866 APE 150 GR3D 0%@114
RAM 4562/7851MB (lfb 644x4MB) SWAP 0/8192MB (cached 0MB) cpu [4%@652,0%@346,75%@345,9%@652,6%@652,0%@652] EMC 7%@1866 APE 150 GR3D 47%@114
RAM 4586/7851MB (lfb 632x4MB) SWAP 0/8192MB (cached 0MB) cpu [24%@960,0%@345,0%@345,39%@960,20%@960,27%@961] EMC 6%@1866 APE 150 GR3D 0%@114
RAM 4805/7851MB (lfb 560x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@959,24%@345,0%@345,35%@961,20%@963,6%@959] EMC 4%@1866 APE 150 GR3D 17%@114
RAM 4989/7851MB (lfb 492x4MB) SWAP 0/8192MB (cached 0MB) cpu [1%@1547,0%@345,0%@345,0%@1565,68%@1564,6%@1564] EMC 4%@1866 APE 150 GR3D 0%@114
RAM 5191/7851MB (lfb 435x4MB) SWAP 0/8192MB (cached 0MB) cpu [2%@345,36%@1706,39%@1707,1%@348,5%@345,6%@349] EMC 3%@1866 APE 150 GR3D 36%@114
RAM 5287/7851MB (lfb 408x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,2%@2013,91%@2007,3%@348,1%@348,5%@348] EMC 4%@1866 APE 150 GR3D 99%@114
RAM 5287/7851MB (lfb 408x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@2034,100%@2035,1%@348,0%@349,0%@348] EMC 10%@1866 APE 150 GR3D 99%@624
RAM 5457/7851MB (lfb 355x4MB) SWAP 0/8192MB (cached 0MB) cpu [29%@1997,45%@2034,8%@2035,2%@1996,0%@1996,4%@1997] EMC 7%@1866 APE 150 GR3D 99%@114
RAM 5457/7851MB (lfb 355x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,100%@2035,0%@2034,3%@349,0%@349,0%@348] EMC 18%@1866 APE 150 GR3D 99%@624
RAM 5463/7851MB (lfb 354x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,14%@2035,2%@2036,3%@349,4%@348,3%@349] EMC 9%@1866 APE 150 GR3D 0%@1134
RAM 2012/7851MB (lfb 1003x4MB) SWAP 0/8192MB (cached 0MB) cpu [1%@2023,47%@345,0%@345,1%@2025,15%@2030,2%@2030] EMC 5%@1866 APE 150 GR3D 0%@318
RAM 2011/7851MB (lfb 1003x4MB) SWAP 0/8192MB (cached 0MB) cpu [0%@345,0%@345,0%@345,4%@348,0%@348,0%@346] EMC 2%@1866 APE 150 GR3D 0%@216
RAM 2011/7851MB (lfb 1003x4MB) SWAP 0/8192MB (cached 0MB) cpu [41%@345,0%@345,0%@345,33%@345,1%@345,2%@346] EMC 1%@1866 APE 150 GR3D 0%@114
RAM 1351/7851MB (lfb 1115x4MB) SWAP 0/8192MB (cached 0MB) cpu [38%@2023,1%@2034,4%@2035,8%@2026,1%@2029,12%@2029] EMC 3%@1866 APE 150 GR3D 0%@114

What’s more, according to journalctl, I find the following error information:

Nov 06 07:32:36 tegra-ubuntu kernel: arm-smmu 12000000.iommu: Unhandled context fault: iova=0x6769f200, fsynr=0x80013, cb=20, sid=16(0x10 - GPU), pgd=1c11dc003, pud=1c
Nov 06 07:32:36 tegra-ubuntu kernel: arm-smmu 12000000.iommu: Unhandled context fault: iova=0x70605b40, fsynr=0x80012, cb=20, sid=16(0x10 - GPU), pgd=1c11dc003, pud=1c
Nov 06 07:32:36 tegra-ubuntu kernel: gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00004000 PBH: 20010180 SHADOW: 20022060 M0: 8000001d 8081018
Nov 06 07:32:36 tegra-ubuntu kernel: gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 505
Nov 06 07:32:36 tegra-ubuntu kernel: gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 504
Nov 06 07:32:36 tegra-ubuntu kernel: gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 503
Nov 06 07:32:36 tegra-ubuntu kernel: gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 502

The detailed dump file can be accessed from https://drive.google.com/open?id=0BzgK-7PlFRj6T0gxTDJ1VVFwUlE

Thanks.

Hi,

From tegrastats data, not an out of memory issue.
Guess that there are specific layers caused issues on Jetson.

Could you help us to dig out this layer?
For example, try faster_rcnn_resnet50_coco network?

Thanks.

I tried ssd_inception_v2_coco model and it failed as well, with the same exception.

I’ll try faster_rcnn_resnet50_coco network later.

Hi deljuven,

I’ll try faster_rcnn_resnet50_coco network later.

Any update?

Thanks

Sorry for updating late.

Following is the memory status with faster_rcnn_resnet50_coco network:

RAM 2463/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@2035,100%@2034,0%@348,8%@349,0%@349] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 2464/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@2034,100%@2035,0%@348,15%@348,1%@348] EMC 0%@1866 APE 150 GR3D 0%@114
RAM 2585/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [1%@345,0%@2007,100%@2005,0%@345,7%@349,0%@348] EMC 1%@1866 APE 150 GR3D 0%@114
RAM 2504/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@2034,100%@2034,0%@348,2%@347,0%@349] EMC 2%@1866 APE 150 GR3D 0%@114
RAM 2504/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@1975,100%@1978,9%@349,1%@348,0%@349] EMC 2%@1866 APE 150 GR3D 0%@114
RAM 2519/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [1%@345,0%@2005,100%@2007,0%@348,9%@348,0%@348] EMC 4%@1866 APE 150 GR3D 0%@114
RAM 2520/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [5%@1345,0%@345,93%@345,0%@1344,2%@1348,5%@1346] EMC 5%@1866 APE 150 GR3D 47%@114
RAM 2601/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [13%@345,2%@1264,36%@1727,22%@499,4%@498,14%@494] EMC 4%@1866 APE 150 GR3D 7%@114
RAM 2684/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [37%@806,10%@345,0%@345,1%@806,1%@806,33%@807] EMC 9%@665 APE 150 GR3D 14%@114
RAM 2823/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [21%@806,9%@653,7%@653,8%@805,29%@805,6%@806] EMC 7%@665 APE 150 GR3D 1%@114
RAM 2955/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [4%@1336,11%@345,0%@345,17%@1344,31%@1345,19%@1346] EMC 2%@1866 APE 150 GR3D 0%@114
RAM 3131/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [14%@345,27%@1981,9%@1983,17%@345,21%@348,19%@348] EMC 3%@1866 APE 150 GR3D 15%@114
RAM 5105/7851MB (lfb 622x256kB) SWAP 9/12288MB (cached 1MB) cpu [1%@345,0%@2033,96%@2035,1%@348,0%@348,1%@348] EMC 8%@1866 APE 150 GR3D 0%@114
RAM 6150/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [2%@345,13%@2006,100%@2005,6%@348,1%@345,15%@348] EMC 9%@1866 APE 150 GR3D 99%@114
RAM 6150/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@2035,100%@2036,2%@345,2%@349,2%@348] EMC 22%@1866 APE 150 GR3D 99%@624
RAM 6150/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [0%@345,0%@2035,100%@2035,0%@345,1%@348,1%@348] EMC 41%@1866 APE 150 GR3D 99%@1134
RAM 6150/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [11%@345,47%@2034,36%@2034,8%@349,5%@346,1%@345] EMC 37%@1866 APE 150 GR3D 99%@1300
RAM 6151/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [2%@345,62%@1260,0%@2034,1%@348,18%@347,1%@345] EMC 49%@1866 APE 150 GR3D 99%@1300
RAM 6151/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [3%@345,0%@2004,80%@2006,4%@345,6%@348,15%@349] EMC 29%@1866 APE 150 GR3D 0%@1300
RAM 3802/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [3%@1574,12%@2009,73%@2014,14%@1573,0%@1574,0%@1574] EMC 15%@1866 APE 150 GR3D 0%@624
RAM 2088/7851MB (lfb 4x2MB) SWAP 9/12288MB (cached 1MB) cpu [1%@345,10%@345,4%@345,0%@348,1%@348,7%@348] EMC 12%@1331 APE 150 GR3D 0%@420

Here is the output trace file(https://drive.google.com/open?id=1WnPazPogwZ0JDyh9lkNGhPYYd7K1cOvN)

Thanks.

Hi,

It looks like an out of memory issue.
Could you reduce the batch size and check if there is also an error with model rfcn_resnet101_coco, faster_rcnn_resnet101_coco…?

Thanks.

I tried the resnet50 model and trained with batch_size=1 on the board but it failed to start the train, killed by oomkiller. I also tried the pretrained model faster_rcnn_resnet50_coco according https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md and this model also failed to run.
I created a 8GB swap but it still failed because of out of memory.

So does anyone have ever successfully run the resnet models on TX2? Any advise would be appreciated.

Hi,

If you want to use GPU for training, the required memory should be GPU accessible.
Swap memory can only access via CPU.

Currently, the maximal GPU memory for TX2 is 8Gb.
Thanks.

I tried using Tensorflow Object detection API on FasterRCNN and got OOM error,this problem seemed no solution till now. My model was trained on windows titian X 1080 ti which could work fine on windows . I noticed an information ,when launching a python infercence script, there would be two processes.

Hi,

Could you share more information about your problem?

There are two initial suggestions for you first:
1. Set batch size to ONE
2. Use a smaller model

Thanks.

Actually i have post another topic here https://devtalk.nvidia.com/default/topic/1027449/jetson-tx2/run-tensorflow-1-3-on-tx2-stuck/post/5227548/#5227548 .Just a inference script but failed.

Hi shartoo518,

Topic 1027449 is closed.
If you still meet this issue, it’s recommended to post a comment there or file a new topic to track.

Thanks.