is there any documentation explaining the “under the hood” differences between these three, specifically on the Tegra K1 SoC? I’ve obviously read their reference documentation, googled a lot, and read a lot of forum posts. But still I haven’t found a clear explanation of where and how memory is allocated in each case, and the performance pros/cons of each one.
My experience so far is that memory allocated with cudaMallocHost() is slower for both CPU and GPU operation (due to caching?), and I see no clear path for moving forward with optimisations in my code regarding these two scenarios:
Few and relatively small CPU writes on a buffer (< 4MB) that has to be read frequently by the GPU.
GPU generating buffers (< 4MB) that have to be entirely CPU-accessible for reading.
Yes, I’ve checked the CUDA programming guide (I said exactly that in my first post).
In my initial post I also asked about specific performance implications for the TK1, so no, I will not move this question to the CUDA Programming and Performance forums, because I don’t want them to send me back to this forum using the same excuse.
To elaborate further, allow me to say that the multimedia ecosystem for the TK1 is a complete mess in terms of efficiency. Video display and encoding is entirely driven by Gstreamer, but there is no way to send device memory to these elements from an application using the appsrc element or any other (this has been previously confirmed in this very same forum).
As a result, all we are left with is host mapped or “pinned” memory, which is atrociously slow when being read from the CPU and therefore unsuitable for encoding or even presentation, and device->host memory transfers, which results in fast-to-read memory for the CPU, but really slow transfers.