Why OpenCV cuda::HoughCirclesDetector is much slower than cv::HoughCircles in JetsonNano?

I build OpenCV based on

with minor changes.
Below is my OpenCV build config:
local CMAKEFLAGS="
-D BUILD_EXAMPLES=OFF
-D BUILD_opencv_python2=ON
-D BUILD_opencv_python3=ON
-D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=${PREFIX}
-D CUDA_ARCH_BIN=5.3,6.2,7.2
-D CUDA_ARCH_PTX=
-D CUDA_FAST_MATH=ON
-D CUDNN_VERSION=‘8.0’
-D EIGEN_INCLUDE_PATH=/usr/include/eigen3
-D ENABLE_NEON=ON
-D OPENCV_DNN_CUDA=ON
-D OPENCV_ENABLE_NONFREE=ON
-D OPENCV_EXTRA_MODULES_PATH=/tmp/build_opencv/opencv_contrib/modules
-D OPENCV_GENERATE_PKGCONFIG=ON
-D WITH_CUBLAS=ON
-D WITH_CUDA=ON
-D WITH_CUDNN=ON
-D WITH_GSTREAMER=ON
-D WITH_LIBV4L=ON
-D WITH_OPENGL=ON
-D WITH_QT=ON
-D WITH_FFMPEG=ON
-D WITH_TBB=ON
-D WITH_V4L=ON"

Below is my test code with how the processing time measured.

cuda::HoughCirclesDetector test code:

cv::cuda::GpuMat gray_gpu;
gray_gpu.upload(gray_img);
int64 t0 = cv::getTickCount();

cv::Ptr<cv::cuda::HoughCirclesDetector> houghCircles = cv::cuda::createHoughCirclesDetector(1.0f, 7, 180, 40, 17, 80);

cv::cuda::GpuMat d_circles;

houghCircles->detect(gray_gpu, d_circles);
qDebug() << " Proc time: " <<  (cv::getTickCount() - t0) * 1000 / (int64)cv::getTickFrequency() << "ms";

The cv::HoughCircles code is as below:

int64 t0 = cv::getTickCount();
vector<cv::Vec4f> circles;
HoughCircles(gray, circles, HOUGH_GRADIENT, 1, 7,180,40,17,80);
qDebug() << " Proc time: " <<  (cv::getTickCount() - t0) * 1000 / (int64)cv::getTickFrequency() << "ms";

They are tested with the same video.
The processing time (average):
cuda::HoughCirclesDetector: 8 ms
cv::HoughCircles: 2 ms

I have noticed the post of

where says
Please build openCV with correct GPU architecture. sm=53 for Nano.

Would you please let me know is there any build config issue?
Where can I set GPU architecture sm = 53 for Jetson nano?

Thanks

CUDA arch should be set in CMake with CUDA_ARCH_BIN=5.3 for TX1/Nano.

Not sure, but IIRC, the GPU version version was returning many concentric circles, while the CPU version was gathering that into one only, this may make a difference.

Also note that cv::cuda may take some time to setup, so first iteration may be slower but next ones may be much faster.

Thank you for your kind reply.
My OpenCV build has
-D CUDA_ARCH_BIN=5.3,6.2,7.2
which included 5.3, but and others: 6.2, 7.2
Do you think my build is OK or I need to remove 6.2 and 7.2 rebuild it?

Based on my test, both GPU and CPU versions return many concentric circles.

Your build is ok. It may also run on a TX2 (6.2) or Xavier(7.2).
For the number of circles, I noticed that but it was 3 years ago and many things have changed since then.