Different types of memory transfer change the execution time of kernel on Tegra x1

Dear all,
i am doing an exploration between classic way of transfer data, pinned memory, unified memory, zero-copy memory and UVA memory type of data.

I observe that apart from the time of the data trasnfer(that is obviously changing), the execution time of the kernels are using the data i have sent with the various types of transfer are changed.

I cannot imagine why, as i am not sure that global memory is cached.

Are the L1,L2 caches on the gpu, caches for data?

If it is, is the only factor that execution time of kernel differs, the more or less cache hits that happen in each way of transfer?

Is there any explanation of why every way of data transfer caches with a different way its data?

I am using Tegra x1, where cpu and gpu shares a common memory.

Thank you in advance!!
Any help will be very useful to me!

This is not an obvious topic, there may be more recent info, but you may start with this.

Not sure, but I think that caching is not enabled for pinned/zero-copy memory. It can be a good scheme if you want to do simple operations (read input, compute only in registers without accessing memory, then output) with gpu on buffers available from CPU without copy.

But if you intend to do more complex processing from GPU requiring storing data and rereading it, then caching would probably be better and you would use Unified memory for that.

Someone with better knowledge may comment further.


Two major memory type on Jetson: pinned memory and unified memory.

You can find the main difference in our CUDA document:
Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.


Thank you for your answers!!
They helped me a lot!

I would like to ask, if Unified and Pinned Memory are achievable on Jetson because of its common memory between cpu and gpu or because of its compute capability?

If it is because of the common memory, what else advantages does the common memory offers us?

Thank you in advance!


Unified and pinned memory are available on both Jetson and x86-machine.
The difference of physical memory is handled by CUDA driver and user can just implement their program without managing it.

To have a shared physical memory, Jetson do have some benefit in transferring data between CPU/GPU.