cudaHostAlloc vs cudaMallocHost vs cudaMallocManaged

EmilioG · October 15, 2016, 11:37am

Hi,

is there any documentation explaining the “under the hood” differences between these three, specifically on the Tegra K1 SoC? I’ve obviously read their reference documentation, googled a lot, and read a lot of forum posts. But still I haven’t found a clear explanation of where and how memory is allocated in each case, and the performance pros/cons of each one.

My experience so far is that memory allocated with cudaMallocHost() is slower for both CPU and GPU operation (due to caching?), and I see no clear path for moving forward with optimisations in my code regarding these two scenarios:

Few and relatively small CPU writes on a buffer (< 4MB) that has to be read frequently by the GPU.
GPU generating buffers (< 4MB) that have to be entirely CPU-accessible for reading.

Any links, ideas or explanations, please?

Thanks in advance,
E.

kayccc · October 17, 2016, 8:25am

Hi EmilioG,

Have you checked the CUDA Programing Guide - [url]Programming Guide :: CUDA Toolkit Documentation?

Or you could try to file this issue into CUDA Programming and Performance board as this board is more specific to TK1 issues:
[url]CUDA Programming and Performance - NVIDIA Developer Forums

Thanks

EmilioG · October 20, 2016, 12:23pm

Sorry for the delayed answer kayccc,

Yes, I’ve checked the CUDA programming guide (I said exactly that in my first post).

In my initial post I also asked about specific performance implications for the TK1, so no, I will not move this question to the CUDA Programming and Performance forums, because I don’t want them to send me back to this forum using the same excuse.

To elaborate further, allow me to say that the multimedia ecosystem for the TK1 is a complete mess in terms of efficiency. Video display and encoding is entirely driven by Gstreamer, but there is no way to send device memory to these elements from an application using the appsrc element or any other (this has been previously confirmed in this very same forum).

As a result, all we are left with is host mapped or “pinned” memory, which is atrociously slow when being read from the CPU and therefore unsuitable for encoding or even presentation, and device->host memory transfers, which results in fast-to-read memory for the CPU, but really slow transfers.

What to do!?!?!?

Topic		Replies	Views
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6263	February 22, 2016
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	15315	March 30, 2022
CUDA memory performance Jetson TK1	3	1195	October 18, 2021
Is cudaMallocmanaged with cudaMemAttachHost flag is faster than malloc? Jetson TK1	1	1455	February 15, 2016
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	9006	November 17, 2021
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2115	February 15, 2016
Performance difference between cudaHostAlloc and malloc Jetson TX1	2	1018	October 18, 2021
unexpected Caching behavior on cudaMallocHost allocated memory for TX2 Jetson TX2	5	837	October 18, 2021
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	846	March 28, 2024
Is cudaMallocHost allocated physical memory? CUDA Programming and Performance	6	1218	July 15, 2020

cudaHostAlloc vs cudaMallocHost vs cudaMallocManaged

Related topics