Performance issues after refactoring CUDA code to avoid managed memory

In the app I’m building, I have to work with a legacy library that acts as a CUDA-based video encoder. Additionally, there are two more libraries using CUDA for drawing on a captured frame and running Torch inference.

Recently, I discovered that when the encoder library runs in parallel with the other two, I encounter sporadic SEGFAULT crashes. However, no crashes occur when processing is done sequentially (e.g., inference → paint → encode). As someone new to CUDA, I investigated and learned that the Jetson platform I’m using (Xavier) doesn’t support managed memory usage while other kernels are running.

The libraries I wrote (Torch inference and painting) use cudaMalloc for device-side allocation and explicit cudaMemcpy for data transfer. However, the encoder library relies exclusively on cudaMallocManaged memory, with this memory being accessed in both device and host code via memset, memcpy, and subscript operators ([]).

To address this, I refactored the encoder library to use cudaMalloc with explicit copying. Since host code still needed to access device memory, I used thrust::device_ptr wrappers.

This resolved the crashes but caused a severe performance drop: from 50–60 FPS to 3–5 FPS. I suspect this slowdown happens because every operator[] call using thrust::device_ptr triggers a device-host memory transfer. While a substantial rework isn’t feasible at the moment, I’m seeking ways to restore acceptable performance.

My questions:

  1. Was switching to cudaMalloc reasonable, or should I have used pinned memory or another approach? (As I understand it, pinned memory also triggers device-host copying.)
  2. One solution I’m considering is explicitly copying data to host memory before processing in host code and then uploading it back to the device afterward. Is this the best approach, or are there better alternatives?

Any insights or suggestions are appreciated!

Pinned memory will not automatically trigger device-host copying. It can be done explicitly.

Your solution in 2. sounds like a good solution, as long as all the memory is needed in device and host anyway for their respective steps. (e.g. if from 4 MB you only need 4 bytes, then a specific cudaMemcpy of that region or managed memory would be more efficient than copying the whole 4 MB).

However: Does Nvidia Jetson Xavier not have unified memory between GPU and CPU?

Thanks for your inputs!
I currently following this path and already seeing improvements (utilizing thrust::device_vector and reduce/transform algos on them) in FPS.

As far as I know, Jetson does have unified memory. However, does this mean, e.g. that malloc’ed memory can be accessed inside CUDA kernel? And, similarly, cudaMalloc’ed can be accessed from the host code? From my experience, it is still not that case (results in SEGFAULTS). Only if the memory was cudaMallocManaged’ed then it can be seamlessly accessed from both.
But I can’t use that because of the reasons explained in the original post (concurrent CUDA code).
Or am I missing something here?
Thanks

a profiler will help identify where the performance issues are. It may be the introduced explicit cudaMemcpy operations. You’d like to avoid those on jetson. In addition to cudaMallocManaged, memory allocated with cudaHostAlloc is also usable on both host and device without a cudaMemcpy, and is not subject to the managed issue. However the caching is different so there may still be perf issues. If your memory usage is fully separable between activities, you might investigate this Segmentation Fault when using UMA and pthreads - #9 by Robert_Crovella

1 Like

also there are lots of smart people on the jetson forums you may get better ideas there

I second the recommendation to seek assistance for Jetson platforms in the sub-forums dedicated to them. The handling of memory in particular differs between these integrated products and platforms with discrete GPUs:

(1) In integrated products, the GPU loses one of its major performance advantages, which is having access to a high-performance dedicated memory with bandwidth of 5x to 10x that of the system memory of a host system.

(2) In integrated products, the GPU loses the performance disadvantage of being supplied with data through a narrow communication pipe, that is, the PCIe interconnect.

These two factors can lead to different work distribution between CPU and GPU portions of an application and significantly impact handling of memory for the GPU portion. The people in the various Jetson sub-forums are likely to have a deeper understanding of relevant trade-offs than the participants of this sub-forum.

2 Likes