RE: Performance issues after refactoring CUDA code to avoid managed memory

pixel.pioneer77 · November 20, 2024, 9:31am

I got advised to repost my question in this subforum, please see it below:

In the app I’m building, I have to work with a legacy library that acts as a CUDA-based video encoder. Additionally, there are two more libraries using CUDA for drawing on a captured frame and running Torch inference.

Recently, I discovered that when the encoder library runs in parallel with the other two, I encounter sporadic SEGFAULT crashes. However, no crashes occur when processing is done sequentially (e.g., inference → paint → encode). As someone new to CUDA, I investigated and learned that the Jetson platform I’m using (Xavier) doesn’t support managed memory usage while other kernels are running.

The libraries I wrote (Torch inference and painting) use cudaMalloc for device-side allocation and explicit cudaMemcpy for data transfer. However, the encoder library relies exclusively on cudaMallocManaged memory, with this memory being accessed in both device and host code via memset, memcpy, and subscript operators ([]).

To address this, I refactored the encoder library to use cudaMalloc with explicit copying. Since host code still needed to access device memory, I used thrust::device_ptr wrappers.

This resolved the crashes but caused a severe performance drop: from 50–60 FPS to 3–5 FPS. I suspect this slowdown happens because every operator[] call using thrust::device_ptr triggers a device-host memory transfer. While a substantial rework isn’t feasible at the moment, I’m seeking ways to restore acceptable performance.

My questions:

Was switching to cudaMalloc reasonable, or should I have used pinned memory or another approach? (As I understand it, pinned memory also triggers device-host copying.)
One solution I’m considering is explicitly copying data to host memory before processing in host code and then uploading it back to the device afterward. Is this the best approach, or are there better alternatives?

Any insights or suggestions are appreciated!

UPD: I started following approach #2 and already got some FPS improvements. But the code requires more work as there’s still bits of host code here and there tries to access (now cudaMalloc’ed) device memory using subscript operator [].

AastaLLL · November 21, 2024, 6:04am

Hi,

1.
What kind of encoder do you use?
You can refer to this document to find one suitable for your use case.
But on Jetson, the camera pipeline usually chooses pinned memory.

2. This might not answer your question but Jetson has a hardware encoder.
Maybe you can check if the following sample can meet your requirements first.

MMAPI:

/usr/src/jetson_multimedia_api/samples/03_video_cuda_enc/

DeepstreamSDK:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/

Thanks.

pixel.pioneer77 · November 21, 2024, 9:06am

Thanks!

What kind of encoder do you use?

It’s a proprietary H264-complaint encoder implemented using CUDA

You can refer to this document to find one suitable for your use case.

I have reviewed this before, however, as someone who’s never dealt with CUDA/GPU programming, this left me with the question, specifically: if the the memory is unified and allocated on the same physical chip, like on Xavier, why pageable host memory is not accessible on GPU and vice versa, why device memory is not accessible on CPU? I’m not getting this… what “unified” is about this memory then? what exactly does cudaMallocManaged invoke for making its’ allocations seamlessly accessible for both (however, this succinct answer starts shedding some light on this for me)?

I keep reading about this though (btw would appreciate any pointers to good CUDA programming resources, paid or free)…

This might not answer your question but Jetson has a hardware encoder.
Maybe you can check if the following sample can meet your requirements first.

thanks, yes, i’m aware of that one, but the focus for us is currently is on making the implementation we have work on Xavier.

AastaLLL · November 25, 2024, 10:23am

Hi,

Unified memory has two copies, one on the CPU and the other on the GPU.
Although the synchronization is done by the GPU driver, the mechanism can induce some overhead if synchronization is required frequently.

Pinned memory is CPU memory but cannot be swapped out.
Since it won’t be swapped out, the pointer is fixed to allow GPU access.

Thanks.

system · December 18, 2024, 5:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	71	November 19, 2024
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	4138	January 7, 2022
Jetson Xavier & Cuda for Tegra & Samples Cuda Codes Jetson AGX Xavier cuda , nvcc	2	2247	September 5, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	8	2008	October 12, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1727	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11651	August 20, 2018
Zero-copy still copy data? Jetson AGX Xavier	7	3721	October 18, 2021
Dual problems with unified memory Jetson Nano	8	1209	October 14, 2021
How to manage CUDA memory? Jetson Xavier NX cuda , python	4	670	December 28, 2022

RE: Performance issues after refactoring CUDA code to avoid managed memory

Related topics