Why __device__ memory is faster than gloabl memory allocate by cudaMalloc?


In this documents device says the device memory is store in gloabl memory, but I found on TX2, after running the same gloabl function with global memory or device memory, the time of executing the function is diffrent when I use diffrent memory. the global memory is 5-8 time slower than the device memory.

Why they have so much diffrences?

A brief check tells me that TX2 is based on Pascal architecture.

Looking at: Pascal Tuning Guide :: CUDA Toolkit Documentation

the following could explain the behaviour you’re seeing:

“…Pascal caches thread-local memory in the L1 cache.” and

“…GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.”

1 Like

Thank you very much.
Does this mean that diffrent architecture may have diffrent behaviour with the same code?

Quite possibly, but that very much depends on a case by case basis. It broadly looks like the above caching behaviour has continued through to the current Ampere generation, so its quite unlikely that you’ll see performance degrading.

If you check out the “Tuning Guides” for them, you can see further improvements in cache size, latency and “tweakables”, as things have progressed and if behaviour changed materially enough, you can conditionally cater to this in your code.

1 Like