RE: Performance issues after refactoring CUDA code to avoid managed memory

I got advised to repost my question in this subforum, please see it below:

In the app I’m building, I have to work with a legacy library that acts as a CUDA-based video encoder. Additionally, there are two more libraries using CUDA for drawing on a captured frame and running Torch inference.

Recently, I discovered that when the encoder library runs in parallel with the other two, I encounter sporadic SEGFAULT crashes. However, no crashes occur when processing is done sequentially (e.g., inference → paint → encode). As someone new to CUDA, I investigated and learned that the Jetson platform I’m using (Xavier) doesn’t support managed memory usage while other kernels are running.

The libraries I wrote (Torch inference and painting) use cudaMalloc for device-side allocation and explicit cudaMemcpy for data transfer. However, the encoder library relies exclusively on cudaMallocManaged memory, with this memory being accessed in both device and host code via memset, memcpy, and subscript operators ([]).

To address this, I refactored the encoder library to use cudaMalloc with explicit copying. Since host code still needed to access device memory, I used thrust::device_ptr wrappers.

This resolved the crashes but caused a severe performance drop: from 50–60 FPS to 3–5 FPS. I suspect this slowdown happens because every operator[] call using thrust::device_ptr triggers a device-host memory transfer. While a substantial rework isn’t feasible at the moment, I’m seeking ways to restore acceptable performance.

My questions:

  1. Was switching to cudaMalloc reasonable, or should I have used pinned memory or another approach? (As I understand it, pinned memory also triggers device-host copying.)
  2. One solution I’m considering is explicitly copying data to host memory before processing in host code and then uploading it back to the device afterward. Is this the best approach, or are there better alternatives?

Any insights or suggestions are appreciated!

UPD: I started following approach #2 and already got some FPS improvements. But the code requires more work as there’s still bits of host code here and there tries to access (now cudaMalloc’ed) device memory using subscript operator [].

Hi,

1.
What kind of encoder do you use?
You can refer to this document to find one suitable for your use case.
But on Jetson, the camera pipeline usually chooses pinned memory.

2. This might not answer your question but Jetson has a hardware encoder.
Maybe you can check if the following sample can meet your requirements first.

MMAPI:

/usr/src/jetson_multimedia_api/samples/03_video_cuda_enc/

DeepstreamSDK:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/

Thanks.

Thanks!

What kind of encoder do you use?

It’s a proprietary H264-complaint encoder implemented using CUDA

You can refer to this document to find one suitable for your use case.

I have reviewed this before, however, as someone who’s never dealt with CUDA/GPU programming, this left me with the question, specifically: if the the memory is unified and allocated on the same physical chip, like on Xavier, why pageable host memory is not accessible on GPU and vice versa, why device memory is not accessible on CPU? I’m not getting this… what “unified” is about this memory then? what exactly does cudaMallocManaged invoke for making its’ allocations seamlessly accessible for both (however, this succinct answer starts shedding some light on this for me)?

I keep reading about this though (btw would appreciate any pointers to good CUDA programming resources, paid or free)…

This might not answer your question but Jetson has a hardware encoder.
Maybe you can check if the following sample can meet your requirements first.

thanks, yes, i’m aware of that one, but the focus for us is currently is on making the implementation we have work on Xavier.

Hi,

Unified memory has two copies, one on the CPU and the other on the GPU.
Although the synchronization is done by the GPU driver, the mechanism can induce some overhead if synchronization is required frequently.

Pinned memory is CPU memory but cannot be swapped out.
Since it won’t be swapped out, the pointer is fixed to allow GPU access.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.