Detectnet on custom model (ssd-mobilenet-v2) got error cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)

Hi,

I retrained the detect model (ssd-mobilenet-v2) with my own dataset and successfully converted it to ONNX and now trying to run it on Jetson Nano 4G kit.

I used this command:

./detectnet --model=/home/jetson/detect/mb2-ssd-lite.onnx --input-blob=input_0 --output-blob=boxes --output-cvg=scores --input_URI /home/jetson/detect/test6.mp4 --input-codec=mpeg4 --log-file=/home/jetson/detect/detectNet-Log.txt --debug --log-level=debug --headless

and got this error:

[TRT]    ------------------------------------------------
[TRT]    Timing Report /home/jetson/detect/mb2-ssd-lite.onnx
[TRT]    ------------------------------------------------
[0;31m[cuda]      device not ready (error 600) (hex 0x258)
[0m[0;31m[cuda]      /home/jetson/detect/AI/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Pre-Process   CPU   0.09969ms  CUDA   0.00000ms
[0;31m[cuda]      invalid resource handle (error 400) (hex 0x190)
[0m[0;31m[cuda]      /home/jetson/detect/AI/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Network       CPU   0.00000ms  CUDA   0.00000ms
[TRT]    Total         CPU   0.09969ms  CUDA   0.00000ms
[TRT]    ------------------------------------------------
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[0;31m[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[0m[0;31m[TRT]    FAILED_EXECUTION: std::exception
[0m[0;31m[TRT]    failed to execute TensorRT context on device GPU
[0m[TRT]    ------------------------------------------------
[TRT]    Timing Report /home/jetson/detect/mb2-ssd-lite.onnx
[TRT]    ------------------------------------------------
[0;31m[cuda]      device not ready (error 600) (hex 0x258)
[0m[0;31m[cuda]      /home/jetson/detect/AI/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Pre-Process   CPU   0.05813ms  CUDA   0.00000ms
[0;31m[cuda]      invalid resource handle (error 400) (hex 0x190)
[0m[0;31m[cuda]      /home/jetson/detect/AI/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Network       CPU   0.00000ms  CUDA   0.00000ms
[TRT]    Total         CPU   0.05813ms  CUDA   0.00000ms
[TRT]    ------------------------------------------------

If I remove the --headless option the video will be displayed but no inference on the video.

Full log is:

[gstreamer] initialized gstreamer, version 1.14.5.0
[gstreamer] gstDecoder -- creating decoder for /home/jetson/detect/test6.mp4
[gstreamer] gstDecoder -- discovered video resolution: 1280x720  (framerate 30.000000 Hz)
[gstreamer] gstDecoder -- discovered video caps:  video/x-h264, stream-format=(string)byte-stream, alignment=(string)au, level=(string)3.1, profile=(string)main, width=(int)1280, height=(int)720, framerate=(fraction)30/1, pixel-aspect-ratio=(fraction)1/1, interlace-mode=(string)progressive, chroma-format=(string)4:2:0, bit-depth-luma=(uint)8, bit-depth-chroma=(uint)8, parsed=(boolean)true
[gstreamer] gstDecoder -- pipeline string:
[gstreamer] filesrc location=/home/jetson/detect/test6.mp4 ! qtdemux ! queue ! h264parse ! omxh264dec ! video/x-raw ! appsink name=mysink
[0;32m[video]  created gstDecoder from file:///home/jetson/detect/test6.mp4
[0m------------------------------------------------
gstDecoder video options:
------------------------------------------------
  -- URI: file:///home/jetson/detect/test6.mp4
     - protocol:  file
     - location:  /home/jetson/detect/test6.mp4
     - extension: mp4
  -- deviceType: file
  -- ioType:     input
  -- codec:      h264
  -- width:      1280
  -- height:     720
  -- frameRate:  30.000000
  -- bitRate:    0
  -- numBuffers: 4
  -- zeroCopy:   true
  -- flipMethod: none
  -- loop:       0
------------------------------------------------
[0;31m[video]  videoOptions -- failed to parse output resource URI (null)
[0m[0;31m[video]  videoOutput -- failed to parse command line options
[0m[0;31mdetectnet:  failed to create output stream
[0m
detectNet -- loading detection network model from:
          -- prototxt     NULL
          -- model        /home/jetson/detect/mb2-ssd-lite.onnx
          -- input_blob   'input_0'
          -- output_cvg   'NULL'
          -- output_bbox  'boxes'
          -- mean_pixel   0.000000
          -- mean_binary  NULL
          -- class_labels NULL
          -- threshold    0.500000
          -- batch_size   1

[TRT]    TensorRT version 7.1.3
[TRT]    loading NVIDIA plugins...
[TRT]    Registered plugin creator - ::GridAnchor_TRT version 1
[TRT]    Registered plugin creator - ::NMS_TRT version 1
[TRT]    Registered plugin creator - ::Reorg_TRT version 1
[TRT]    Registered plugin creator - ::Region_TRT version 1
[TRT]    Registered plugin creator - ::Clip_TRT version 1
[TRT]    Registered plugin creator - ::LReLU_TRT version 1
[TRT]    Registered plugin creator - ::PriorBox_TRT version 1
[TRT]    Registered plugin creator - ::Normalize_TRT version 1
[TRT]    Registered plugin creator - ::RPROI_TRT version 1
[TRT]    Registered plugin creator - ::BatchedNMS_TRT version 1
[0;31m[TRT]    Could not register plugin creator -  ::FlattenConcat_TRT version 1
[0m[TRT]    Registered plugin creator - ::CropAndResize version 1
[TRT]    Registered plugin creator - ::DetectionLayer_TRT version 1
[TRT]    Registered plugin creator - ::Proposal version 1
[TRT]    Registered plugin creator - ::ProposalLayer_TRT version 1
[TRT]    Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TRT]    Registered plugin creator - ::ResizeNearest_TRT version 1
[TRT]    Registered plugin creator - ::Split version 1
[TRT]    Registered plugin creator - ::SpecialSlice_TRT version 1
[TRT]    Registered plugin creator - ::InstanceNormalization_TRT version 1
[TRT]    detected model format - ONNX  (extension '.onnx')
[TRT]    desired precision specified for GPU: FASTEST
[TRT]    native precisions detected for GPU:  FP32, FP16
[TRT]    selecting fastest native precision for GPU:  FP16
[TRT]    attempting to open engine cache file /home/jetson/detect/mb2-ssd-lite.onnx.1.1.7103.GPU.FP16.engine
[TRT]    loading network plan from engine cache... /home/jetson/detect/mb2-ssd-lite.onnx.1.1.7103.GPU.FP16.engine
[0;32m[TRT]    device GPU, loaded /home/jetson/detect/mb2-ssd-lite.onnx
[0m[TRT]    Deserialize required 2829932 microseconds.
[TRT]    
[TRT]    CUDA engine context initialized on device GPU:
[TRT]       -- layers       144
[TRT]       -- maxBatchSize 1
[TRT]       -- workspace    0
[TRT]       -- deviceMemory 20514816
[TRT]       -- bindings     3
[TRT]       binding 0
                -- index   0
                -- name    'input_0'
                -- type    FP32
                -- in/out  INPUT
                -- # dims  4
                -- dim #0  1 (SPATIAL)
                -- dim #1  3 (SPATIAL)
                -- dim #2  300 (SPATIAL)
                -- dim #3  300 (SPATIAL)
[TRT]       binding 1
                -- index   1
                -- name    'scores'
                -- type    FP32
                -- in/out  OUTPUT
                -- # dims  3
                -- dim #0  1 (SPATIAL)
                -- dim #1  3000 (SPATIAL)
                -- dim #2  8 (SPATIAL)
[TRT]       binding 2
                -- index   2
                -- name    'boxes'
                -- type    FP32
                -- in/out  OUTPUT
                -- # dims  3
                -- dim #0  1 (SPATIAL)
                -- dim #1  3000 (SPATIAL)
                -- dim #2  4 (SPATIAL)
[TRT]    
[TRT]    binding to input 0 input_0  binding index:  0
[TRT]    binding to input 0 input_0  dims (b=1 c=3 h=300 w=300) size=1080000
[cuda]   cudaAllocMapped 1080000 bytes, CPU 0x100e60000 GPU 0x100e60000
[TRT]    binding to output 0 boxes  binding index:  2
[TRT]    binding to output 0 boxes  dims (b=1 c=3000 h=4 w=1) size=48000
[cuda]   cudaAllocMapped 48000 bytes, CPU 0x100d60200 GPU 0x100d60200
[TRT]    
[0;32m[TRT]    device GPU, /home/jetson/detect/mb2-ssd-lite.onnx initialized.
[0m[TRT]    detectNet -- number object classes:  4
[TRT]    detectNet -- maximum bounding boxes:  3000
[cuda]   cudaAllocMapped 1344000 bytes, CPU 0x100f68000 GPU 0x100f68000
[cuda]   cudaAllocMapped 64 bytes, CPU 0x100d6be00 GPU 0x100d6be00
[gstreamer] opening gstDecoder for streaming, transitioning pipeline to GST_STATE_PLAYING
[gstreamer] gstreamer changed state from NULL to READY ==> mysink
[gstreamer] gstreamer changed state from NULL to READY ==> capsfilter1
[gstreamer] gstreamer changed state from NULL to READY ==> omxh264dec-omxh264dec0
[gstreamer] gstreamer changed state from NULL to READY ==> h264parse1
[gstreamer] gstreamer changed state from NULL to READY ==> queue0
[gstreamer] gstreamer changed state from NULL to READY ==> qtdemux1
[gstreamer] gstreamer changed state from NULL to READY ==> filesrc0
[gstreamer] gstreamer changed state from NULL to READY ==> pipeline0
[gstreamer] gstreamer changed state from READY to PAUSED ==> capsfilter1
[gstreamer] gstreamer changed state from READY to PAUSED ==> omxh264dec-omxh264dec0
[gstreamer] gstreamer changed state from READY to PAUSED ==> h264parse1
[gstreamer] gstreamer stream status CREATE ==> src
[gstreamer] gstreamer changed state from READY to PAUSED ==> queue0
[gstreamer] gstreamer stream status ENTER ==> src
[gstreamer] gstreamer stream status CREATE ==> sink
[gstreamer] gstreamer changed state from READY to PAUSED ==> qtdemux1
[gstreamer] gstreamer changed state from READY to PAUSED ==> filesrc0
[gstreamer] gstDecoder -- onPreroll()
[gstreamer] gstreamer stream status ENTER ==> sink
[gstreamer] gstreamer message stream-start ==> pipeline0
[gstreamer] gstreamer stream status CREATE ==> src
[gstreamer] gstreamer message duration-changed ==> h264parse1
[gstreamer] gstreamer stream status ENTER ==> src
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ /\ AVC";
[gstreamer] gstreamer mysink taglist, datetime=(datetime)2018-04-22T07:30:53Z, private-qt-tag=(sample){ 00000019677373740000001164617461000000010000000030:None:R3N0U2VnbWVudCwgZmxhZ3M9KEdzdFNlZ21lbnRGbGFncylHU1RfU0VHTUVOVF9GTEFHX05PTkUsIHJhdGU9KGRvdWJsZSkxLCBhcHBsaWVkLXJhdGU9KGRvdWJsZSkxLCBmb3JtYXQ9KEdzdEZvcm1hdClHU1RfRk9STUFUX1RJTUUsIGJhc2U9KGd1aW50NjQpMCwgb2Zmc2V0PShndWludDY0KTAsIHN0YXJ0PShndWludDY0KTAsIHN0b3A9KGd1aW50NjQpMTg0NDY3NDQwNzM3MDk1NTE2MTUsIHRpbWU9KGd1aW50NjQpMCwgcG9zaXRpb249KGd1aW50NjQpMCwgZHVyYXRpb249KGd1aW50NjQpMTg0NDY3NDQwNzM3MDk1NTE2MTU7AA__:YXBwbGljYXRpb24veC1nc3QtcXQtZ3NzdC10YWcsIHN0eWxlPShzdHJpbmcpaXR1bmVzOwA_, 0000001f677374640000001764617461000000010000000032313636393535:None:R3N0U2VnbWVudCwgZmxhZ3M9KEdzdFNlZ21lbnRGbGFncylHU1RfU0VHTUVOVF9GTEFHX05PTkUsIHJhdGU9KGRvdWJsZSkxLCBhcHBsaWVkLXJhdGU9KGRvdWJsZSkxLCBmb3JtYXQ9KEdzdEZvcm1hdClHU1RfRk9STUFUX1RJTUUsIGJhc2U9KGd1aW50NjQpMCwgb2Zmc2V0PShndWludDY0KTAsIHN0YXJ0PShndWludDY0KTAsIHN0b3A9KGd1aW50NjQpMTg0NDY3NDQwNzM3MDk1NTE2MTUsIHRpbWU9KGd1aW50NjQpMCwgcG9zaXRpb249KGd1aW50NjQpMCwgZHVyYXRpb249KGd1aW50NjQpMTg0NDY3NDQwNzM3MDk1NTE2MTU7AA__:YXBwbGljYXRpb24veC1nc3QtcXQtZ3N0ZC10YWcsIHN0eWxlPShzdHJpbmcpaXR1bmVzOwA_ }, container-format=(string)"ISO\ MP4/M4A";
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)";
[gstreamer] gstDecoder recieve caps:  video/x-raw, format=(string)NV12, width=(int)1280, height=(int)720, interlace-mode=(string)progressive, multiview-mode=(string)mono, multiview-flags=(GstVideoMultiviewFlagsSet)0:ffffffff:/right-view-first/left-flipped/left-flopped/right-flipped/right-flopped/half-aspect/mixed-mono, pixel-aspect-ratio=(fraction)1/1, chroma-site=(string)mpeg2, colorimetry=(string)bt709, framerate=(fraction)30/1
[gstreamer] gstDecoder -- recieved first frame, codec=h264 format=nv12 width=1280 height=720 size=1382400
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[cuda]   cudaAllocMapped 1382400 bytes, CPU 0x1010b1000 GPU 0x1010b1000
[cuda]   cudaAllocMapped 1382400 bytes, CPU 0x101203000 GPU 0x101203000
[cuda]   cudaAllocMapped 1382400 bytes, CPU 0x101355000 GPU 0x101355000
[cuda]   cudaAllocMapped 1382400 bytes, CPU 0x1014a7000 GPU 0x1014a7000
RingBuffer -- allocated 4 buffers (1382400 bytes each, 5529600 bytes total)
[gstreamer] gstreamer changed state from READY to PAUSED ==> mysink
[gstreamer] gstreamer changed state from READY to PAUSED ==> pipeline0
[gstreamer] gstreamer message async-done ==> pipeline0
[gstreamer] gstreamer message new-clock ==> pipeline0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> mysink
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> capsfilter1
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> omxh264dec-omxh264dec0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> h264parse1
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> queue0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> qtdemux1
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> filesrc0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> pipeline0
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[cuda]   cudaAllocMapped 2764800 bytes, CPU 0x1015f9000 GPU 0x1015f9000
[cuda]   cudaAllocMapped 2764800 bytes, CPU 0x10189c000 GPU 0x10189c000
[cuda]   cudaAllocMapped 2764800 bytes, CPU 0x101b3f000 GPU 0x101b3f000
[cuda]   cudaAllocMapped 2764800 bytes, CPU 0x101de2000 GPU 0x101de2000
RingBuffer -- allocated 4 buffers (2764800 bytes each, 11059200 bytes total)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)3441840, maximum-bitrate=(uint)3441840, bitrate=(uint)2974128;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1334400, maximum-bitrate=(uint)3441840, bitrate=(uint)2825061;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1267920, maximum-bitrate=(uint)3441840, bitrate=(uint)2695300;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1267920, maximum-bitrate=(uint)3764880, bitrate=(uint)2777575;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1267920, maximum-bitrate=(uint)3764880, bitrate=(uint)2672468;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3764880, bitrate=(uint)2577008;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3764880, bitrate=(uint)2632665;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3764880, bitrate=(uint)2555802;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3764880, bitrate=(uint)2484400;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3764880, bitrate=(uint)2426640;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3843120, bitrate=(uint)2491025;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3843120, bitrate=(uint)2438786;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3843600, bitrate=(uint)2456482;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1201680, maximum-bitrate=(uint)3843600, bitrate=(uint)2414656;
[0;31m[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[0m[0;31m[TRT]    FAILED_EXECUTION: std::exception
[0m[0;31m[TRT]    failed to execute TensorRT context on device GPU
[0m
[TRT]    ------------------------------------------------
[TRT]    Timing Report /home/jetson/detect/mb2-ssd-lite.onnx
[TRT]    ------------------------------------------------
[TRT]    Pre-Process   CPU   0.06526ms  CUDA   3.22250ms
[0;31m[cuda]      invalid resource handle (error 400) (hex 0x190)
[0m[0;31m[cuda]      home/jetson/detect/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Network       CPU   0.00000ms  CUDA   0.00000ms
[TRT]    Total         CPU   0.06526ms  CUDA   3.22250ms
[TRT]    ------------------------------------------------

[0;33m[TRT]    note -- when processing a single image, run 'sudo jetson_clocks' before
                to disable DVFS for more accurate profiling/timing measurements

[0m[0;31m[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[0m[0;31m[TRT]    FAILED_EXECUTION: std::exception
[0m[0;31m[TRT]    failed to execute TensorRT context on device GPU
[0m
[TRT]    ------------------------------------------------
[TRT]    Timing Report /home/jetson/detect/mb2-ssd-lite.onnx
[TRT]    ------------------------------------------------
[TRT]    Pre-Process   CPU   0.06438ms  CUDA   1.32266ms
[0;31m[cuda]      invalid resource handle (error 400) (hex 0x190)
[0m[0;31m[cuda]      home/jetson/detect/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
[0m[TRT]    Network       CPU   0.00000ms  CUDA   0.00000ms
[TRT]    Total         CPU   0.06438ms  CUDA   1.32266ms
[TRT]    ------------------------------------------------

[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[0;31m[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[0m[0;31m[TRT]    FAILED_EXECUTION: std::exception
[0m[0;31m[TRT]    failed to execute TensorRT context on device GPU

And here is the log from python detect-net.py with the same arguments

...
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
CUTENSOR ERROR: some argument is NULL.
[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[TRT]    FAILED_EXECUTION: std::exception
[TRT]    failed to execute TensorRT context on device GPU
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1240560, maximum-bitrate=(uint)3843600, bitrate=(uint)2456482;
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1201680, maximum-bitrate=(uint)3843600, bitrate=(uint)2414656;
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
[gstreamer] gstDecoder -- recieved 1280x720 frame (1382400 bytes)
Traceback (most recent call last):
  File "detectnet.py", line 63, in <module>
[gstreamer] gstreamer mysink taglist, video-codec=(string)"H.264\ \(Main\ Profile\)", minimum-bitrate=(uint)1201680, maximum-bitrate=(uint)3843600, bitrate=(uint)2365026;
    detections = net.Detect(img, overlay=opt.overlay)
Exception: jetson.inference -- detectNet.Detect() encountered an error classifying the image
PyTensorNet_Dealloc()
jetson.utils -- PyVideoSource_Dealloc()
[gstreamer] gstDecoder -- stopping pipeline, transitioning to GST_STATE_NULL
[gstreamer] gstDecoder -- onPreroll()
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> capsfilter1
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> omxh264dec-omxh264dec0
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> h264parse1
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> queue0
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> qtdemux1
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> filesrc0
[gstreamer] gstreamer changed state from PLAYING to PAUSED ==> pipeline0
[gstreamer] gstDecoder -- pipeline stopped
jetson.utils -- PyVideoOutput_Dealloc()
jetson.utils -- PyCudaMemory_Dealloc()

I also attached the log file and model file for your reference.
detectNet-Log.txt

**Update:
I ran the command /usr/src/tensorrt/bin/trtexec --onnx=/home/jetson/detect/mb2-ssd-lite.onnx --verbose and got output PASS
mb2-ssd-lite.zip

**

[11/16/2020-13:15:29] [I] Starting inference threads
[11/16/2020-13:15:32] [I] Warmup completed 5 queries over 200 ms
[11/16/2020-13:15:32] [I] Timing trace has 123 queries over 3.07144 s
[11/16/2020-13:15:32] [I] Trace averages of 10 runs:
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 28.9624 ms - Host latency: 29.1034 ms (end to end 29.1151 ms, enqueue 7.66055 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.3881 ms - Host latency: 24.5172 ms (end to end 24.5276 ms, enqueue 10.1836 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4319 ms - Host latency: 24.5609 ms (end to end 24.5718 ms, enqueue 6.74628 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4466 ms - Host latency: 24.5759 ms (end to end 24.5864 ms, enqueue 8.47657 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4543 ms - Host latency: 24.5837 ms (end to end 24.5945 ms, enqueue 6.29825 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4601 ms - Host latency: 24.5901 ms (end to end 24.6006 ms, enqueue 6.54089 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4666 ms - Host latency: 24.5964 ms (end to end 24.6071 ms, enqueue 7.14116 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4557 ms - Host latency: 24.5844 ms (end to end 24.5947 ms, enqueue 6.34243 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4625 ms - Host latency: 24.5919 ms (end to end 24.6027 ms, enqueue 6.3897 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.573 ms - Host latency: 24.707 ms (end to end 24.7175 ms, enqueue 6.37815 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.552 ms - Host latency: 24.6826 ms (end to end 24.6932 ms, enqueue 6.35994 ms)
[11/16/2020-13:15:32] [I] Average on 10 runs - GPU latency: 24.4029 ms - Host latency: 24.5328 ms (end to end 24.5435 ms, enqueue 10.0791 ms)
[11/16/2020-13:15:32] [I] Host Latency
[11/16/2020-13:15:32] [I] min: 24.4467 ms (end to end 24.4572 ms)
[11/16/2020-13:15:32] [I] max: 34.8445 ms (end to end 34.8571 ms)
[11/16/2020-13:15:32] [I] mean: 24.9597 ms (end to end 24.9704 ms)
[11/16/2020-13:15:32] [I] median: 24.5864 ms (end to end 24.5973 ms)
[11/16/2020-13:15:32] [I] percentile: 34.8171 ms at 99% (end to end 34.8304 ms at 99%)
[11/16/2020-13:15:32] [I] throughput: 40.0464 qps
[11/16/2020-13:15:32] [I] walltime: 3.07144 s
[11/16/2020-13:15:32] [I] Enqueue Time
[11/16/2020-13:15:32] [I] min: 4.98193 ms
[11/16/2020-13:15:32] [I] max: 13.2275 ms
[11/16/2020-13:15:32] [I] median: 7.02148 ms
[11/16/2020-13:15:32] [I] GPU Compute
[11/16/2020-13:15:32] [I] min: 24.3189 ms
[11/16/2020-13:15:32] [I] max: 34.6916 ms
[11/16/2020-13:15:32] [I] mean: 24.8289 ms
[11/16/2020-13:15:32] [I] median: 24.4583 ms
[11/16/2020-13:15:32] [I] percentile: 34.6632 ms at 99%
[11/16/2020-13:15:32] [I] total compute time: 3.05395 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/home/jetson/detect/mb2-ssd-lite.onnx --verbose

How to solve this kind of error as I searched and haven’t find any solution yet.

Hi,

Could you run the application with cuda-memcheck to see if any memory-related issue first?

$ sudo /usr/local/cuda-10.2/bin/cuda-memcheck ./detectnet --model=/home/jetson/detect/mb2-ssd-lite.onnx  ...

Thanks.

(Edited) Thank you for your reply, I ran your command and it shows

$ sudo /usr/local/cuda/bin/cuda-memcheck ./detectnet --model=/home/jetson/detect/mb2-ssd-lite.onnx --input-blob=input_0 --output-blob=boxes --output-cvg=scores --input_URI /home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg --output_URI /home/jetson/detect/jetson-inference/build/aarch64/bin/result.jpg --debug
========= CUDA-MEMCHECK
[image] imageLoader -- found file /home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg
[video]  created imageLoader from file:///home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg
------------------------------------------------
imageLoader video options:
------------------------------------------------
  -- URI: file:///home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg
     - protocol:  file
     - location:  /home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg
     - extension: jpg
  -- deviceType: file
  -- ioType:     input
  -- codec:      unknown
  -- width:      0
  -- height:     0
  -- frameRate:  0.000000
  -- bitRate:    0
  -- numBuffers: 4
  -- zeroCopy:   true
  -- flipMethod: none
  -- loop:       0
------------------------------------------------
[video]  created imageWriter from file:///home/jetson/detect/jetson-inference/build/aarch64/bin/result.jpg
------------------------------------------------
imageWriter video options:
------------------------------------------------
  -- URI: file:///home/jetson/detect/jetson-inference/build/aarch64/bin/result.jpg
     - protocol:  file
     - location:  /home/jetson/detect/jetson-inference/build/aarch64/bin/result.jpg
     - extension: jpg
  -- deviceType: file
  -- ioType:     output
  -- codec:      unknown
  -- width:      0
  -- height:     0
  -- frameRate:  0.000000
  -- bitRate:    0
  -- numBuffers: 4
  -- zeroCopy:   true
  -- flipMethod: none
  -- loop:       0
------------------------------------------------
[OpenGL] glDisplay -- X screen 0 resolution:  1920x1080
[OpenGL] glDisplay -- X window resolution:    1920x1080
[OpenGL] glDisplay -- display device initialized (1920x1080)
[video]  created glDisplay from display://0
------------------------------------------------
glDisplay video options:
------------------------------------------------
  -- URI: display://0
     - protocol:  display
     - location:  0
  -- deviceType: display
  -- ioType:     output
  -- codec:      raw
  -- width:      1920
  -- height:     1080
  -- frameRate:  0.000000
  -- bitRate:    0
  -- numBuffers: 4
  -- zeroCopy:   true
  -- flipMethod: none
  -- loop:       0
------------------------------------------------

detectNet -- loading detection network model from:
          -- prototxt     NULL
          -- model        /home/jetson/detect/mb2-ssd-lite.onnx
          -- input_blob   'input_0'
          -- output_cvg   'NULL'
          -- output_bbox  'boxes'
          -- mean_pixel   0.000000
          -- mean_binary  NULL
          -- class_labels NULL
          -- threshold    0.500000
          -- batch_size   1

[TRT]    TensorRT version 7.1.3
[TRT]    loading NVIDIA plugins...
[TRT]    Registered plugin creator - ::GridAnchor_TRT version 1
[TRT]    Registered plugin creator - ::NMS_TRT version 1
[TRT]    Registered plugin creator - ::Reorg_TRT version 1
[TRT]    Registered plugin creator - ::Region_TRT version 1
[TRT]    Registered plugin creator - ::Clip_TRT version 1
[TRT]    Registered plugin creator - ::LReLU_TRT version 1
[TRT]    Registered plugin creator - ::PriorBox_TRT version 1
[TRT]    Registered plugin creator - ::Normalize_TRT version 1
[TRT]    Registered plugin creator - ::RPROI_TRT version 1
[TRT]    Registered plugin creator - ::BatchedNMS_TRT version 1
[TRT]    Could not register plugin creator -  ::FlattenConcat_TRT version 1
[TRT]    Registered plugin creator - ::CropAndResize version 1
[TRT]    Registered plugin creator - ::DetectionLayer_TRT version 1
[TRT]    Registered plugin creator - ::Proposal version 1
[TRT]    Registered plugin creator - ::ProposalLayer_TRT version 1
[TRT]    Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TRT]    Registered plugin creator - ::ResizeNearest_TRT version 1
[TRT]    Registered plugin creator - ::Split version 1
[TRT]    Registered plugin creator - ::SpecialSlice_TRT version 1
[TRT]    Registered plugin creator - ::InstanceNormalization_TRT version 1
[TRT]    detected model format - ONNX  (extension '.onnx')
[TRT]    desired precision specified for GPU: FASTEST
[TRT]    requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT]    native precisions detected for GPU:  FP32, FP16
[TRT]    selecting fastest native precision for GPU:  FP16
[TRT]    attempting to open engine cache file /home/jetson/detect/mb2-ssd-lite.onnx.1.1.7103.GPU.FP16.engine
[TRT]    loading network plan from engine cache... /home/jetson/detect/mb2-ssd-lite.onnx.1.1.7103.GPU.FP16.engine
[TRT]    device GPU, loaded /home/jetson/detect/mb2-ssd-lite.onnx
[TRT]    Deserialize required 20281293 microseconds.
[TRT]    
[TRT]    CUDA engine context initialized on device GPU:
[TRT]       -- layers       144
[TRT]       -- maxBatchSize 1
[TRT]       -- workspace    0
[TRT]       -- deviceMemory 20514816
[TRT]       -- bindings     3
[TRT]       binding 0
                -- index   0
                -- name    'input_0'
                -- type    FP32
                -- in/out  INPUT
                -- # dims  4
                -- dim #0  1 (SPATIAL)
                -- dim #1  3 (SPATIAL)
                -- dim #2  300 (SPATIAL)
                -- dim #3  300 (SPATIAL)
[TRT]       binding 1
                -- index   1
                -- name    'scores'
                -- type    FP32
                -- in/out  OUTPUT
                -- # dims  3
                -- dim #0  1 (SPATIAL)
                -- dim #1  3000 (SPATIAL)
                -- dim #2  8 (SPATIAL)
[TRT]       binding 2
                -- index   2
                -- name    'boxes'
                -- type    FP32
                -- in/out  OUTPUT
                -- # dims  3
                -- dim #0  1 (SPATIAL)
                -- dim #1  3000 (SPATIAL)
                -- dim #2  4 (SPATIAL)
[TRT]    
[TRT]    binding to input 0 input_0  binding index:  0
[TRT]    binding to input 0 input_0  dims (b=1 c=3 h=300 w=300) size=1080000
[cuda]   cudaAllocMapped 1080000 bytes, CPU 0x100e60000 GPU 0x100e60000
[TRT]    binding to output 0 boxes  binding index:  2
[TRT]    binding to output 0 boxes  dims (b=1 c=3000 h=4 w=1) size=48000
[cuda]   cudaAllocMapped 48000 bytes, CPU 0x100d60200 GPU 0x100d60200
[TRT]    
[TRT]    device GPU, /home/jetson/detect/mb2-ssd-lite.onnx initialized.
[TRT]    detectNet -- number object classes:  4
[TRT]    detectNet -- maximum bounding boxes:  3000
[cuda]   cudaAllocMapped 1344000 bytes, CPU 0x100f68000 GPU 0x100f68000
[cuda]   cudaAllocMapped 64 bytes, CPU 0x100d6be00 GPU 0x100d6be00
[image] loaded '/home/jetson/detect/jetson-inference/build/aarch64/bin/images/city_1.jpg'  (1024x512, 3 channels)
[cuda]   cudaAllocMapped 1572864 bytes, CPU 0x1010c0000 GPU 0x1010c0000
CUTENSOR ERROR: some argument is NULL.
[TRT]    ../rtSafe/cuda/cutensorReformat.cpp (352) - cuTensor Error in executeCutensor: 7 (Internal cuTensor reformat failed)
[TRT]    FAILED_EXECUTION: std::exception
[TRT]    failed to execute TensorRT context on device GPU
[OpenGL] glDisplay -- set the window size to 1024x512
[OpenGL] creating 1024x512 texture (GL_RGB8 format, 1572864 bytes)
[cuda]   registered openGL texture for interop access (1024x512, GL_RGB8, 1572864 bytes)
[image] saved '/home/jetson/detect/jetson-inference/build/aarch64/bin/result.jpg'  (1024x512, 3 channels)
[cuda]      invalid resource handle (error 400) (hex 0x190)
[cuda]      /home/jetson/detect/jetson-inference/build/aarch64/include/jetson-inference/tensorNet.h:685
========= Program hit cudaErrorInvalidResourceHandle (error 400) due to "invalid resource handle" on CUDA API call to cudaEventElapsedTime. 
=========     Saved host backtrace up to driver entry point at error

[TRT]    ------------------------------------------------
[TRT]    Timing Report /home/jetson/detect/mb2-ssd-lite.onnx
=========     Host Frame:/usr/lib/aarch64-linux-gnu/libcuda.so [0x2fd95c]
[TRT]    ------------------------------------------------
=========     Host Frame:./detectnet [0x39614]
=========
[TRT]    Pre-Process   CPU   2.69094ms  CUDA 130.55162ms
[TRT]    Network       CPU   0.00000ms  CUDA   0.00000ms
[TRT]    Total         CPU   2.69094ms  CUDA 130.55162ms
[TRT]    ------------------------------------------------

[TRT]    note -- when processing a single image, run 'sudo jetson_clocks' before
                to disable DVFS for more accurate profiling/timing measurements

[image] imageLoader -- End of Stream (EOS) has been reached, stream has been closed
detectnet:  shutting down...
detectnet:  shutdown complete.
========= ERROR SUMMARY: 1 error

Note that I can run the default model ssd-mobilenet-v2.

It looks like this is the error comes from.

Have you tried your model with trtexec yet?
If not, could you give it a try?

/usr/src/tensorrt/bin/trtexec --onnx=/home/jetson/detect/mb2-ssd-lite.onnx --verbose

It will be good to figure out the error comes from TensorRT or jetson-inference first.

Thanks.

@AastaLLL I tried the trtexec and get PASS in both models.

[11/17/2020-11:52:56] [I] Trace averages of 10 runs:
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.367 ms - Host latency: 24.4956 ms (end to end 24.5063 ms, enqueue 10.3746 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4185 ms - Host latency: 24.5471 ms (end to end 24.558 ms, enqueue 8.26195 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4109 ms - Host latency: 24.5388 ms (end to end 24.5494 ms, enqueue 6.69408 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4242 ms - Host latency: 24.553 ms (end to end 24.5637 ms, enqueue 6.6713 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.3894 ms - Host latency: 24.5175 ms (end to end 24.5284 ms, enqueue 9.44207 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.3895 ms - Host latency: 24.518 ms (end to end 24.5288 ms, enqueue 8.86799 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4277 ms - Host latency: 24.5559 ms (end to end 24.5664 ms, enqueue 6.68746 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4214 ms - Host latency: 24.5495 ms (end to end 24.5602 ms, enqueue 7.96597 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4584 ms - Host latency: 24.5881 ms (end to end 24.5989 ms, enqueue 7.80803 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.5636 ms - Host latency: 24.6951 ms (end to end 24.7057 ms, enqueue 6.26746 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.479 ms - Host latency: 24.6078 ms (end to end 24.6186 ms, enqueue 6.23298 ms)
[11/17/2020-11:52:56] [I] Average on 10 runs - GPU latency: 24.4273 ms - Host latency: 24.5558 ms (end to end 24.5666 ms, enqueue 6.75588 ms)
[11/17/2020-11:52:56] [I] Host Latency
[11/17/2020-11:52:56] [I] min: 24.4456 ms (end to end 24.4566 ms)
[11/17/2020-11:52:56] [I] max: 24.7849 ms (end to end 24.7952 ms)
[11/17/2020-11:52:56] [I] mean: 24.5608 ms (end to end 24.5715 ms)
[11/17/2020-11:52:56] [I] median: 24.5538 ms (end to end 24.5646 ms)
[11/17/2020-11:52:56] [I] percentile: 24.7512 ms at 99% (end to end 24.7625 ms at 99%)
[11/17/2020-11:52:56] [I] throughput: 40.6964 qps
[11/17/2020-11:52:56] [I] walltime: 3.04695 s
[11/17/2020-11:52:56] [I] Enqueue Time
[11/17/2020-11:52:56] [I] min: 4.96204 ms
[11/17/2020-11:52:56] [I] max: 12.6416 ms
[11/17/2020-11:52:56] [I] median: 7.22003 ms
[11/17/2020-11:52:56] [I] GPU Compute
[11/17/2020-11:52:56] [I] min: 24.3177 ms
[11/17/2020-11:52:56] [I] max: 24.6479 ms
[11/17/2020-11:52:56] [I] mean: 24.4321 ms
[11/17/2020-11:52:56] [I] median: 24.4244 ms
[11/17/2020-11:52:56] [I] percentile: 24.6204 ms at 99%
[11/17/2020-11:52:56] [I] total compute time: 3.02958 s
&&&& PASSED

Here is the model

https://drive.google.com/drive/folders/14cAzr011guUBjSm2Z-rBjPr4UbuQqaGv

I tried both ssd-mobilenet and mobilenetv2-ssd-lite but both of them output the same error.

Note: I am using jetpack 4.4 on Jetson Nano. I can run both model (onnx) using jetson-inference on x86 PC.