Please provide the following info (tick the boxes after creating this topic): Software Version
DRIVE OS 6.0.10.0
[1] DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other
Target Operating System
[1] Linux
QNX
other
Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
[1] DRIVE AGX Orin Developer Kit (not sure its number)
other
SDK Manager Version
2.1.0
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other
Hi team,
I am working on Drive Orin. I compiled a CUDA accelerated face detection code on target but I am getting very low frame rate. I calculated its gpu utilisation using tegrastats. The code is utilising 99% of gpu but getting very low frame rates. The frames are lagging. Any help would be appreciated.
Dear @akshay.tupkar,
Could you profile the application using nsys to get more insights? What is the expected FPS for your application? Is your face detection uses any DNN or normal CUDA kernel implementation? If CUDA kernel we need to see the kernel execution times to have right expectation about FPS. If DNN, what is the inference time noticed using trtexec.
Dear @SivaRamaKrishnaNV
The code is not using any DNN modue for face detection. It is just a normal CUDA implementation.
I am attaching the code for your reference. face_detection_cuda.txt (2.6 KB)
Dear @akshay.tupkar,
The code seems to be based on openCV and not just CUDA kernel. I see CPU ↔ GPU memory transfers in the code which are contributing to the delay. Also, I am not sure about the internal openCV CUDA implementation. The data transfers can be avoided if you use DW framework.
Note we don’t officially support openCV and hence we don’t evaluate the perf on Orin.
If you have image processing CUDA kernel and wants to use DW framework, please see DriveWorks SDK Reference: Image Capture Sample to integrate CUDA kernel in your DW application pipeline.
I did not understand your ask. Are you asking about any CUDA API for memory data transfer? If so, please see cudaMemcpy* calls in CUDA documentation.
Could you check the time spent in memory transfer calls between CPU<-> GPU and face detection call. This helps to get the right expectations on FPS.
FYI, On tegra, both CPU and GPU share the same physical SoC. But based on memory type of memory allocation, the CPU ↔ GPU data transferred can be avoided. Please see CUDA for Tegra to know different type of memories that can be allocated using CUDA. Also, we have NvStreams which can be used to share the data across different APIs like NvMedia(used to capture camera frame), CUDA(used to process the data) and OpenGL( to render the output). When NvStreams is used, no additional data transfer is needed to move data across these module. Our DW framework makes use of these APIs internally to avoid unneccessary data transfer. Please check DriveWorks SDK Reference: Image for more details and check the image streamer API usage in DW sample. OpenCV may not do these optimization and hence you see additional data transfer when moving data across CPU->GPU->OpenGL(for rendering).
I see your use case as Capture camera frames(1) → Process using GPU(2) → render using GL(3)
sample object detector sample is demonstration of integrating a DNN to DW framework. If you have DNN which can do face detection, you can leverage from this sample. In the above pipeline, step 2 makes use of TRT engine and uses GPU to achieve this.
If you have CV algorithm implemented as CUDA kernel which can do face detection, in the above pipeline, Step 2 will be a set of GPU kernels which gets the data as CUDA Image from camera (NvMedia Image) using dwImageStreamer.