Extremely slow inference with MMDetection on Jetson Xavier NX

Hi everyone,

I’m currently trying to run a very basic code on my Jetson Xavier NX in order to do object detection on a video, with MMDetection. But it seems that whatever the model I test, it takes an average of 1 second to infer a single frame (0.7s for the best one I checked), which is extremely slow and under the expected inference time advertised on the mmdet website (~50 fps).

I also tested the mmdetection demo scipts (video_demo.py and video_gpuacc.py), tried to convert my mmdet model to a TensorRT model (fp16 and int8 tested), but I still have approximatively the same results.

I really don’t know what I’m missing …

Please note that I previously worked on YoloV3 with Darknet and I had no problem like this.
My code can be seen below.

Environnement

Python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
CUDA available: True
GPU 0: Xavier
CUDA_HOME: /usr/local/cuda-11.4
NVCC: Cuda compilation tools, release 11.4, V11.4.166
GCC: aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 11.4
  - NVCC architecture flags: -gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_87,code=sm_87
  - CuDNN 8.3.2
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=11.4, CUDNN_VERSION=8.3.2, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=open, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.11.1
OpenCV: 4.5.5
MMCV: 1.5.2
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.4
MMDetection: 2.25.0+ca11860

My code

from mmdet.apis import init_detector, inference_detector
import mmcv
import cv2

config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
# wrap_fp16_model(model)

def main():
	video_reader = mmcv.VideoReader("/home/thalesgroup/Thales/medias/video_sample.mp4")
	
	for frame in mmcv.track_iter_progress(video_reader):
		result = inference_detector(model,frame)
		frame = model.show_result(frame, result)
		cv2.namedWindow('Processed video', 0)
		mmcv.imshow(frame, 'Processed video', 1)

if __name__ == '__main__':
	main()

Any help or idea is welcomed, thanks !

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hello ! Thanks for your answer, I just tried it but it doesn’t change anything …

Hi,

Could you share how you convert the model into TensorRT?
Do you use the flow: .pth - .onnx - .trt

Thanks.

Hi, I’m using a project that simplifies the conversion between MMDet model and TensorRT model :

I’m not actually an expert on the subject, I’ve just began so idk if this really works …
Is there another simple way to convert a model into TensorRT (int8 preferably) ?
Thanks a lot for your help.

Edit:
I tried Detectron2 models and I had similar results (~1 FPS) far below those expected … I checked GPU usage with Tegrastats and it seems the GPU is well used (peak at 100% every second).
Thanks to anyone who will take the time to help me.

Hi,

We also have a sample to inference Detectron2 with TensorRT.
Would you mind giving it a try to see if the performance improves?

Thanks.