I got advised to repost my question in this subforum, please see it below:
In the app I’m building, I have to work with a legacy library that acts as a CUDA-based video encoder. Additionally, there are two more libraries using CUDA for drawing on a captured frame and running Torch inference.
Recently, I discovered that when the encoder library runs in parallel with the other two, I encounter sporadic SEGFAULT crashes. However, no crashes occur when processing is done sequentially (e.g., inference → paint → encode). As someone new to CUDA, I investigated and learned that the Jetson platform I’m using (Xavier) doesn’t support managed memory usage while other kernels are running.
The libraries I wrote (Torch inference and painting) use cudaMalloc
for device-side allocation and explicit cudaMemcpy
for data transfer. However, the encoder library relies exclusively on cudaMallocManaged
memory, with this memory being accessed in both device and host code via memset
, memcpy
, and subscript operators ([]
).
To address this, I refactored the encoder library to use cudaMalloc
with explicit copying. Since host code still needed to access device memory, I used thrust::device_ptr
wrappers.
This resolved the crashes but caused a severe performance drop: from 50–60 FPS to 3–5 FPS. I suspect this slowdown happens because every operator[]
call using thrust::device_ptr
triggers a device-host memory transfer. While a substantial rework isn’t feasible at the moment, I’m seeking ways to restore acceptable performance.
My questions:
- Was switching to
cudaMalloc
reasonable, or should I have used pinned memory or another approach? (As I understand it, pinned memory also triggers device-host copying.) - One solution I’m considering is explicitly copying data to host memory before processing in host code and then uploading it back to the device afterward. Is this the best approach, or are there better alternatives?
Any insights or suggestions are appreciated!
UPD: I started following approach #2 and already got some FPS improvements. But the code requires more work as there’s still bits of host code here and there tries to access (now cudaMalloc
’ed) device memory using subscript operator []
.