Hi. I am using the Nsight analysis tool to profile my program, the numbers reported for memory are not making sense to me. I used the “Profile” option with “All” experiments to Run, for all my kernels.
Under memory statistics for a kernel, you have the tabs Overview, Global, Local, Atomics, Shared, Caches and Buffers.
In the Overview tab, you are supposed to see how the kernel is interacting with the different memory level and types, like global, local, atomic, REDs(?) shared and texture, also L1 and L2 caches. The reported numbers for my kernel s only show numbers for Global and shared, and nothing on L1 and L2, which makes no sense at all, plus, the numbers reported are in requests, or requests/s… In the other tabs its the same story, in more details, though with all fields being 0. How to make sense out of this? I am sure any kernel makes use of L1 and L2.
A table also shows how much each warp does as requests, but what is meant by a request? If I think about a request being a full L1 cache line sized transaction (btw, request can apparently be a store or a load), then the request number should be multiplied by 128 bits to give us a tangible number (I left caching to L1 level default for the compiler), say 1 MRequests/s = 128 bits/s = 16 MBytes/s. However requests from memory, even if it can and is done 128 bits at a time, it might be done twice for misaligned addresses, or some of its data might be unneeded (or both), this means the effective memory usage is much less that the full request size (assuming again the request means the whole 128 bit full cache line read). Is this true for the numbers reported in the nsight profiler?
Thanks in advance for any help in the matter.