the Unified cache and Device memory in kernel memory, but i can not know the different of global load and device memory read. And sometimes, the global loads is greater than the memory bandwidth?
I don’t understand what you are asking in the first sentence. Regarding the second sentence, yes, global loads can be greater than device memory bandwidth if some of the global loads are hitting in one of the caches.
there are several memory spaces - shared, local (stack), global. So the first counter measures reads from the Global memory space. Both Global and Local memory spaces are mapped to device memory, and cached in L1 and L2 caches. So the second measure counts physical reads - either from L1$, L2$ or device memory
See for example http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#device-memory-spaces or any CUDA textbook
Hi BulatZiganshin, tanks for your reply.
According to your sentences, I think the Unified Cache (Nsight -> “Analysis” console -> “kernel Memory” tag -> “Results” -> Unified Cache) that counter measures reads/loads from the device memory and cache memory?
Thanks for your reply txbob.
You mean the global loads counts from device memory and caches?
The Unified cache and Device memory in first sentence that you can find :
Nsight -> “Analysis” console -> “kernel Memory” tag -> “Results” -> Unified Cache
Nsight -> “Analysis” console -> “kernel Memory” tag -> “Results” -> Device Memory
I don not know the meaning of the Global Loads and Device Memory.