Face Detection is not detected using detectMultiScale when GPU is enabled

Installed Packages

Operating System -> Ubuntu 18.04

NVIDIA Driver -> Tesla Driver for Ubuntu 18.04 Version: 418.87.00

CUDA -> CUDA Toolkit 10.1

CuDNN -> CuDNN 7.6

CMake -> 3.15.2

OpenCV -> 3.4.1

Caffe -> 1.0.0

Tensorflow-GPU -> 1.4.0

GPU -> Tesla T4

Issue while Executing Application

In CPU Mode

When we disable GPU Mode and run through CPU Mode the face detection is working …

int flag = CV_HAAR_DO_ROUGH_SEARCH|CV_HAAR_SCALE_IMAGE;

 this->face_detector_cpu.detectMultiScale(gray, faces_tmp, 1.2, 1, flag,

                                                     cv::Size(param.minFaceSize, param.minFaceSize),

                                                     cv::Size(param.maxFaceSize, param.maxFaceSize));

Face Detection is detecting with less FPS

In GPU Mode

When we disable the GPU mode and executing our Application the face detection is working fine

But if we enable the GPU mode the face detection is not working its getting freezed in detectMultiScale function

The following is the code for the reference

FaceDetector::getInstance().detect(frame, faces_haar, faceParam, FaceDetectionType == "HAAR",  enable_gpu, scale);

this->face_detector_gpu->detectMultiScale(im_gpu, faces_gpu);

Thanks for sharing this blog!! Fantastic post!!

Here has the info at bottom:
https://devtalk.nvidia.com/default/topic/1045586/jetson-agx-xavier/opencv-the-function-detectmultiscale-hangs-crashes-on-xavier/

I can reproduce this bug with OpenCV 3.4.0 cascadeclassifier sample on Tesla T4, but can’t on P4. This is the same as the link ChrisDing provided above.
After some debug, I got attached change. With this change, cascadeclassifier doesn’t hang in the GPU detectMultiScale() call and can do face detection correctly.

Before you get official fix from OpenCV if you raise this to OpenCV, you may could try attached WAR.

WAR: OpenCV-3.4.0: fix cascadeclassifier sample hang/block on
 Turing GPU

On Turing GPU, e.g. T4, cascadeclassifier sample - samples/gpu/cascadeclassifier.cpp
hangs in below call

    cascade_gpu->detectMultiScale(resized_gpu, facesBuf_gpu);

This sample works on Tesla P4 before below change.
---
 modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu | 2 +-
 modules/cudalegacy/src/cuda/NPP_staging.cu            | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
index e6e5e52..47982f2 100644
--- a/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
+++ b/modules/cudalegacy/src/cuda/NCVHaarObjectDetection.cu
@@ -173,7 +173,7 @@ __device__ Ncv32u scan1Inclusive(Ncv32u idata, volatile Ncv32u *s_Data)
 const Ncv32u MAX_GRID_DIM = 65535;

-const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 64;
+const Ncv32u NUM_THREADS_ANCHORSPARALLEL = 32; //64;

#define NUM_THREADS_CLASSIFIERPARALLEL_LOG2     6
diff --git a/modules/cudalegacy/src/cuda/NPP_staging.cu b/modules/cudalegacy/src/cuda/NPP_staging.cu
index a96f44f..b29c6d3 100644
--- a/modules/cudalegacy/src/cuda/NPP_staging.cu
+++ b/modules/cudalegacy/src/cuda/NPP_staging.cu
@@ -99,11 +99,14 @@ inline __device__ T warpScanInclusive(T idata, volatile T *s_Data)
 #if __CUDA_ARCH__ >= 300
     const unsigned int laneId = cv::cuda::device::Warp::laneId();
 
+    unsigned mask = __activemask();// need to get mask for active threads
+
     // scan on shuffl functions
     #pragma unroll
     for (int i = 1; i <= (K_WARP_SIZE / 2); i *= 2)
     {
-        const T n = cv::cuda::device::shfl_up(idata, i);
+        //const T n = cv::cuda::device::shfl_up(idata, i);
+        const T n = __shfl_up_sync(mask, idata, i, 32);//need to call with proper mask
         if (laneId >= i)
               idata += n;
     }

Thank you for the solution @ChrisDing and @Michi

I also tried the solution which you mentioned now its working in Opencv 3.4.1