How to count memory requests? as reported in nsight analysis

Gorune · May 31, 2012, 2:10pm

Hi. I am using the Nsight analysis tool to profile my program, the numbers reported for memory are not making sense to me. I used the “Profile” option with “All” experiments to Run, for all my kernels.

Under memory statistics for a kernel, you have the tabs Overview, Global, Local, Atomics, Shared, Caches and Buffers.

In the Overview tab, you are supposed to see how the kernel is interacting with the different memory level and types, like global, local, atomic, REDs(?) shared and texture, also L1 and L2 caches. The reported numbers for my kernel s only show numbers for Global and shared, and nothing on L1 and L2, which makes no sense at all, plus, the numbers reported are in requests, or requests/s… In the other tabs its the same story, in more details, though with all fields being 0. How to make sense out of this? I am sure any kernel makes use of L1 and L2.

A table also shows how much each warp does as requests, but what is meant by a request? If I think about a request being a full L1 cache line sized transaction (btw, request can apparently be a store or a load), then the request number should be multiplied by 128 bits to give us a tangible number (I left caching to L1 level default for the compiler), say 1 MRequests/s = 128 bits/s = 16 MBytes/s. However requests from memory, even if it can and is done 128 bits at a time, it might be done twice for misaligned addresses, or some of its data might be unneeded (or both), this means the effective memory usage is much less that the full request size (assuming again the request means the whole 128 bit full cache line read). Is this true for the numbers reported in the nsight profiler?

Thanks in advance for any help in the matter.

Topic		Replies	Views
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	432	January 15, 2024
Same SOL for memory and SM Throughput Nsight Compute	2	117	November 8, 2024
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2592	July 3, 2024
Question about L1 Cache in Memory Workload Analysis Table Nsight Compute cuda , nsight	2	652	May 11, 2022
Same kernel and data exhibits different performance CUDA Programming and Performance	3	472	December 3, 2021
Cache line size of L1 and L2 CUDA Programming and Performance	3	19866	November 14, 2011
Uncoalesced Shared Accesses CUDA Programming and Performance	2	656	September 6, 2023
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	847	April 7, 2024
Analyzing bank conflicts with Nsight compute CUDA Programming and Performance	1	2221	August 14, 2020
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1860	February 3, 2012

How to count memory requests? as reported in nsight analysis

Related topics