In this documents device says the device memory is store in gloabl memory, but I found on TX2, after running the same gloabl function with global memory or device memory, the time of executing the function is diffrent when I use diffrent memory. the global memory is 5-8 time slower than the device memory.
the following could explain the behaviour you’re seeing:
“…Pascal caches thread-local memory in the L1 cache.” and
“…GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.”
Quite possibly, but that very much depends on a case by case basis. It broadly looks like the above caching behaviour has continued through to the current Ampere generation, so its quite unlikely that you’ll see performance degrading.
If you check out the “Tuning Guides” for them, you can see further improvements in cache size, latency and “tweakables”, as things have progressed and if behaviour changed materially enough, you can conditionally cater to this in your code.