Facing issue of getting very low frame rate when used cuda while face detection

akshay.tupkar · August 20, 2024, 4:53am

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
[1] DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
[1] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
[1] DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hi team,
I am working on Drive Orin. I compiled a CUDA accelerated face detection code on target but I am getting very low frame rate. I calculated its gpu utilisation using tegrastats. The code is utilising 99% of gpu but getting very low frame rates. The frames are lagging. Any help would be appreciated.

SivaRamaKrishnaNV · August 20, 2024, 5:15am

Dear @akshay.tupkar,
Could you profile the application using nsys to get more insights? What is the expected FPS for your application? Is your face detection uses any DNN or normal CUDA kernel implementation? If CUDA kernel we need to see the kernel execution times to have right expectation about FPS. If DNN, what is the inference time noticed using trtexec.

akshay.tupkar · August 20, 2024, 5:25am

Dear @SivaRamaKrishnaNV
The code is not using any DNN modue for face detection. It is just a normal CUDA implementation.
I am attaching the code for your reference.
face_detection_cuda.txt (2.6 KB)

SivaRamaKrishnaNV · August 20, 2024, 5:56am

Dear @akshay.tupkar,
The code seems to be based on openCV and not just CUDA kernel. I see CPU ↔ GPU memory transfers in the code which are contributing to the delay. Also, I am not sure about the internal openCV CUDA implementation. The data transfers can be avoided if you use DW framework.
Note we don’t officially support openCV and hence we don’t evaluate the perf on Orin.

If you have image processing CUDA kernel and wants to use DW framework, please see DriveWorks SDK Reference: Image Capture Sample to integrate CUDA kernel in your DW application pipeline.

akshay.tupkar · August 21, 2024, 8:22am

Dear @SivaRamaKrishnaNV
By using this sample, can i perform face detection using cuda?
I meant to say are there any API used for memory transfer?

SivaRamaKrishnaNV · August 21, 2024, 10:57am

I did not understand your ask. Are you asking about any CUDA API for memory data transfer? If so, please see cudaMemcpy* calls in CUDA documentation.
Could you check the time spent in memory transfer calls between CPU<-> GPU and face detection call. This helps to get the right expectations on FPS.

FYI, On tegra, both CPU and GPU share the same physical SoC. But based on memory type of memory allocation, the CPU ↔ GPU data transferred can be avoided. Please see 1. CUDA for Tegra — CUDA for Tegra 13.1 documentation to know different type of memories that can be allocated using CUDA. Also, we have NvStreams which can be used to share the data across different APIs like NvMedia(used to capture camera frame), CUDA(used to process the data) and OpenGL( to render the output). When NvStreams is used, no additional data transfer is needed to move data across these module. Our DW framework makes use of these APIs internally to avoid unneccessary data transfer. Please check DriveWorks SDK Reference: Image for more details and check the image streamer API usage in DW sample. OpenCV may not do these optimization and hence you see additional data transfer when moving data across CPU->GPU->OpenGL(for rendering).

I see your use case as Capture camera frames(1) → Process using GPU(2) → render using GL(3)

sample object detector sample is demonstration of integrating a DNN to DW framework. If you have DNN which can do face detection, you can leverage from this sample. In the above pipeline, step 2 makes use of TRT engine and uses GPU to achieve this.

If you have CV algorithm implemented as CUDA kernel which can do face detection, in the above pipeline, Step 2 will be a set of GPU kernels which gets the data as CUDA Image from camera (NvMedia Image) using dwImageStreamer.

Topic		Replies	Views
GPU is getting utilized 99 % with face detection code using caffe model DRIVE AGX Orin General driveos-dl	5	176	November 4, 2024
Error: no CUDA-capable device is detected on Nvidia Drive AGX Orin DRIVE AGX Orin General driveos-cuda	8	382	June 12, 2024
OPENCV Face Detect speed up Using CUDA way to speed up the face detect CUDA Programming and Performance	0	2485	August 8, 2008
Issue with cudaMemPrefetchAsync on drive orin device DRIVE AGX Orin General driveos-cuda	2	96	September 3, 2025
TensorFlow2.16.1+nv24.08 is running very slow in Jetson AGX Orin Deep Learning (Training & Inference) camera , cuda , tensorflow , computer-vision-cv , jetson	0	75	January 16, 2025
Drive AGX Orin TensorRT inference failed DRIVE AGX Orin General driveos-dl	24	1415	August 24, 2023
GPU not detected in NVIDIA DRIVE AGX Orin DevKit after flashing Drive OS 6.0.10 DRIVE AGX Orin General driveos	4	158	April 2, 2025
Opencv SURF with CUDA is not faster by a noticeable amount on agx orin Jetson AGX Orin cuda	5	249	November 21, 2024
Modified sample object detector tracker DRIVE AGX Orin General driveos-dl	6	557	February 27, 2024
DRIVE OS 6.0.10 / DriveWorks on RTX 5070 Ti (Blackwell) – general compatibility DRIVE AGX Orin General drive-misc	4	152	January 12, 2026

Facing issue of getting very low frame rate when used cuda while face detection

Related topics