Different types of memory transfer change the execution time of kernel

Dear all,
i am doing an explration between classic way of transfer data, pinned memory, unified memory, zero-copy memory and UVA memory type of data.

I observe that apart from the time of the data trasnfer, the execution time of the kernels are using the the data i have sent with the various types of transfer are changed.

I cannot imagine why, as i think that global memory is not cached.

Has anyone any idea?
I am using Tegra x1, where cpu and gpu shares a common memory.

Thank you in advance!

why do you say global memory is not cached?

It is cached.

You may want to ask questions about TX1 on the appropriate forum:

https://devtalk.nvidia.com/default/board/139/jetson-embedded-systems/

Hello txbob and thank you for your answer.
Because a professor of mine had told me that the the caches that exist in the gpu are for caching the instructions and not the data.
Could you please recommend me a link that explains that data in gpu are cached?

Thank you in advance!

Hello txbob again,
i searched and i found that indeed gpu’s cache is for data too. Thank you for the information. I would like to ask, the only factor that execution time of kernel differs, is the more or less cache hits that happen in each way of transfer?
And if it is, is there any explanation of why every way of data transfer caches with a different way its data?

Thank you in advance!

I’m not sure what all the factors are. You haven’t shown any code. Certainly pinned/zero-copy memory on TX1 should be faster than a method using cudaMalloc, since the use of cudaMalloc implies an actual copy operation on the data, which is not required in the other cases, and that has nothing to do with the cache.

I think in the context of TX1/TX2 this topic has been discussed extensively elsewhere. You might ask your question on the TX1 forum that I already pointed out, or study some of the questions there.

Ok, thaink you!