Face Detection is not detected using detectMultiScale when GPU is enabled

RajeshMuthuswamy · October 22, 2019, 4:33am

Installed Packages

Operating System → Ubuntu 18.04

NVIDIA Driver → Tesla Driver for Ubuntu 18.04 Version: 418.87.00

CUDA → CUDA Toolkit 10.1

CuDNN → CuDNN 7.6

CMake → 3.15.2

OpenCV → 3.4.1

Caffe → 1.0.0

Tensorflow-GPU → 1.4.0

GPU → Tesla T4

Issue while Executing Application

In CPU Mode

When we disable GPU Mode and run through CPU Mode the face detection is working …

int flag = CV_HAAR_DO_ROUGH_SEARCH|CV_HAAR_SCALE_IMAGE;

 this->face_detector_cpu.detectMultiScale(gray, faces_tmp, 1.2, 1, flag,

                                                     cv::Size(param.minFaceSize, param.minFaceSize),

                                                     cv::Size(param.maxFaceSize, param.maxFaceSize));

Face Detection is detecting with less FPS

In GPU Mode

When we disable the GPU mode and executing our Application the face detection is working fine

But if we enable the GPU mode the face detection is not working its getting freezed in detectMultiScale function

The following is the code for the reference

FaceDetector::getInstance().detect(frame, faces_haar, faceParam, FaceDetectionType == "HAAR",  enable_gpu, scale);

this->face_detector_gpu->detectMultiScale(im_gpu, faces_gpu);

printertechnicalsupport · October 23, 2019, 12:26pm

Thanks for sharing this blog!! Fantastic post!!

ChrisDing · October 25, 2019, 3:12am

Here has the info at bottom:
[url]https://devtalk.nvidia.com/default/topic/1045586/jetson-agx-xavier/opencv-the-function-detectmultiscale-hangs-crashes-on-xavier/[/url]

mchi · October 26, 2019, 2:02am

I can reproduce this bug with OpenCV 3.4.0 cascadeclassifier sample on Tesla T4, but can’t on P4. This is the same as the link ChrisDing provided above.
After some debug, I got attached change. With this change, cascadeclassifier doesn’t hang in the GPU detectMultiScale() call and can do face detection correctly.

Before you get official fix from OpenCV if you raise this to OpenCV, you may could try attached WAR.

WAR: OpenCV-3.4.0: fix cascadeclassifier sample hang/block on
 Turing GPU

On Turing GPU, e.g. T4, cascadeclassifier sample - samples/gpu/cascadeclassifier.cpp
hangs in below call

    cascade_gpu->detectMultiScale(resized_gpu, facesBuf_gpu);

This sample works on Tesla P4 before below change.
---
 modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu | 2 +-
 modules/cudalegacy/src/cuda/NPP_staging.cu            | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
index e6e5e52..47982f2 100644
--- a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
+++ b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
@@ -173,7 +173,7 @@ __device__ Ncv32u scan1Inclusive(Ncv32u idata, volatile Ncv32u *s_Data)
 const Ncv32u MAX_GRID_DIM = 65535;

-const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 64;
+const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 32; //64;

#define NUM_THREADS_CLASSIFIERPARALLEL_LOG2     6
diff --git a/modules/cudalegacy/src/cuda/NPP_staging.cu b/modules/cudalegacy/src/cuda/NPP_staging.cu
index a96f44f..b29c6d3 100644
--- a/modules/cudalegacy/src/cuda/NPP_staging.cu
+++ b/modules/cudalegacy/src/cuda/NPP_staging.cu
@@ -99,11 +99,14 @@ inline __device__ T warpScanInclusive(T idata, volatile T *s_Data)
 #if __CUDA_ARCH__ >= 300
     const unsigned int laneId = cv::cuda::device::Warp::laneId();
 
+    unsigned mask = __activemask();// need to get mask for active threads
+
     // scan on shuffl functions
     #pragma unroll
     for (int i = 1; i <= (K_WARP_SIZE / 2); i *= 2)
     {
-        const T n = cv::cuda::device::shfl_up(idata, i);
+        //const T n = cv::cuda::device::shfl_up(idata, i);
+        const T n = __shfl_up_sync(mask, idata, i, 32);//need to call with proper mask
         if (laneId >= i)
               idata += n;
     }

RajeshMuthuswamy · November 27, 2019, 2:53am

mchi:

I can reproduce this bug with OpenCV 3.4.0 cascadeclassifier sample on Tesla T4, but can’t on P4. This is the same as the link ChrisDing provided above.
After some debug, I got attached change. With this change, cascadeclassifier doesn’t hang in the GPU detectMultiScale() call and can do face detection correctly.

Before you get official fix from OpenCV if you raise this to OpenCV, you may could try attached WAR.

WAR: OpenCV-3.4.0: fix cascadeclassifier sample hang/block on
 Turing GPU

On Turing GPU, e.g. T4, cascadeclassifier sample - samples/gpu/cascadeclassifier.cpp
hangs in below call

    cascade_gpu->detectMultiScale(resized_gpu, facesBuf_gpu);

This sample works on Tesla P4 before below change.
---
 modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu | 2 +-
 modules/cudalegacy/src/cuda/NPP_staging.cu            | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
index e6e5e52..47982f2 100644
--- a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
+++ b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
@@ -173,7 +173,7 @@ __device__ Ncv32u scan1Inclusive(Ncv32u idata, volatile Ncv32u *s_Data)
 const Ncv32u MAX_GRID_DIM = 65535;

-const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 64;
+const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 32; //64;

#define NUM_THREADS_CLASSIFIERPARALLEL_LOG2     6
diff --git a/modules/cudalegacy/src/cuda/NPP_staging.cu b/modules/cudalegacy/src/cuda/NPP_staging.cu
index a96f44f..b29c6d3 100644
--- a/modules/cudalegacy/src/cuda/NPP_staging.cu
+++ b/modules/cudalegacy/src/cuda/NPP_staging.cu
@@ -99,11 +99,14 @@ inline __device__ T warpScanInclusive(T idata, volatile T *s_Data)
 #if __CUDA_ARCH__ >= 300
     const unsigned int laneId = cv::cuda::device::Warp::laneId();
 
+    unsigned mask = __activemask();// need to get mask for active threads
+
     // scan on shuffl functions
     #pragma unroll
     for (int i = 1; i <= (K_WARP_SIZE / 2); i *= 2)
     {
-        const T n = cv::cuda::device::shfl_up(idata, i);
+        //const T n = cv::cuda::device::shfl_up(idata, i);
+        const T n = __shfl_up_sync(mask, idata, i, 32);//need to call with proper mask
         if (laneId >= i)
               idata += n;
     }

Thank you for the solution @ChrisDing and @Michi

I also tried the solution which you mentioned now its working in Opencv 3.4.1

Topic		Replies	Views
OpenCV : the function detectMultiScale hangs / crashes on Xavier Jetson AGX Xavier	14	3088	October 18, 2021
Titan V & OpenCV - Hanging on cv::cuda::CascadeClassifier::detectMultiScale Linux	0	926	February 26, 2018
Opencv Face Detection Poor Performance with jetson nano Jetson Nano opencv	51	14244	October 14, 2021
Failing to detect GPU's on Tesla T4's nvc, nvc++ and nvfortran	6	48	June 5, 2025
face detection using opencv-gpu module Jetson TK1 opencv	4	1438	October 18, 2021
opencv and Cuda Jetson TX1 opencv	5	2818	October 18, 2021
[Problem] I cannot create HAAR-based classifier for cv::cuda::CascadeClassifier. Jetson TX1 opencv	8	6367	October 18, 2021
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	151	August 16, 2024
Very poor Performance with with NVIDIA Jetson Nano 2GB in Face Recognition Jetson Nano python	7	3424	March 28, 2022
Tesla Compute Cluster driver released non-display driver for 64-bit Windows Server 08/08 R2 CUDA Programming and Performance	37	30459	October 21, 2014

Face Detection is not detected using detectMultiScale when GPU is enabled

Related topics