OpenCV CUDA Functions Get Slow When MxNet Library Linked

Hello everyone,

We have a project that detects faces and recognizes them. Project’s programming language is CPP. We use Caffe for detecting faces and MxNet Insightface for recognizing them. When we commented all recognition code, CUDA functions in detecting algorithm works on GPU. However, if we include recognition code and link MxNet library, MxNet functions works on GPU but CUDA functions in detecting algorithm does not work on GPU.

We understand this result by looking GPU, CPU usage.
Without Recognition → All cores of CPU is about 40%, GPU is 17%-27% (That means CUDA functions doesn’t work on CPU, works on GPU)
With Recognition → All cores of CPU is about 100%, GPU is 0%-15% (That means MxNet functions works on GPU and CUDA functions works on CPU)

And also we considered process time of detection and recognition algorithm.
Without Recognition → Detection algorithm takes about 60ms (which is expected)
With Recognition → Detection algorithm takes about 1000ms (which is unusual) and Recognition algorithm takes about 15ms (which is expected)

We couldn’t figure out why this happens. Even if you don’t have any solution, some theories and knowledge may save our time. Please share with us!

Environment Information
Jetson Xavier AGX → Jetpack 4.5.1
CUDA 10.2
CuDNN 8.0.0
MxNet 1.8.x
OpenCV 3.4.5


Could you validate if the MXNet recognition runs on CPU or GPU first?

Based on your profiling result.
In the first experiment, the CUDA function tasks 60ms*27% = ~16ms to finish.
When integrated with recognition, the GPU percentage drop to 15%, which is roughly equal to ~16ms/1000ms.

This indicates that MXNet is mainly working with the CPU.
And the average GPU usage drop since the long CPU working time.



There are 4 scenarios.

  • None of them is working on GPU → then why there is a GPU process? All should have be in CPU.
  • CUDA functions work on GPU, MxNet works on CPU → then CUDA functions should behave normally (all of them would executed in ~60ms) which is incorrect
  • Both of them work on GPU → then why all CPU cores process 100%? All process power should have be in GPU
  • CUDA functions work on CPU, MxNet works on GPU → which is real state I think because detection algorithm is full of CUDA functions and it longs ~1000ms which is normally ~60ms

In addition, I had forgotten to say that. We have third experiment. We linked MxNet and uncommented recognition algorithm. This part is fully the same with second experiment I told you. The different part is, we disabled recognition with global variables. Program includes both detection and recognition, but recognition part is never executed. Even in this situation, CUDA functions takes ~1000ms to be executed.

We thought that when we link MxNet library, this library prevents CUDA or maybe change some configurations about GPU, or allocate all GPU memory and this causes CUDA to work on CPU. These all are just theories, but if you have any idea, we will be pleased to hear.

Thank you so much!


It sounds related to the implementation of MXNet/OpenCV.
Have you checked with the MXNet or OpenCV team about this before?

More, have you tried the same pipeline on a desktop GPU?
Would you mind doing so and share if the same issue occurs?