Same SOL for memory and SM Throughput

quim.aguado · October 8, 2024, 10:08am

Hello,

I have a kernel that makes intensive usage of memory in a “gather” manner, that is, each thread performs one or multiple reads to global memory, but the memory addresses are not necessarily coalesced between the threads. Usually, the memory addresses read across threads have a certain locality (i.e. they are not very sparse).

When profiling with ncu, I get exactly the same SOL for compute and memory.

When I look at the throughput breakdown, I see that the memory is limited by “L1: Lsuin Requests”, and the Compute is limited by “SM: Inst Executed Pipe Lsu”.

My understanding is that, when an SM executes a global memory instruction for a warp, a request containing the information of all participant threads of the warp is sent to the L1, then, the L1 has multiple pipelined processing stages.

I assume that the profiler is counting the throughput of these requests both as Compute and Memory. Is that assumption correct? Therefore, would this kernel be considered memory or compute bound?

To improve the performance of the kernel, should I try to reduce the number of requests sent to the memory subsystem?

Thank you

Greg · October 9, 2024, 7:06pm

For GV100+ using the unified L1 the

Compute Throughput metric SM:Inst Executed Pipe LSU (%), and
Memory Throughput metric L1:Lsuin Requests (%)

have the same rate.

In this case I would go to the next value down in the list for Compute Throughput and Memory Throughput which are

SM: Issue Active = 52%
L1: Data Pipe Lsu Wavefronts = 58%

I would interpret this as latency bound.

The SM is only using ~50% of the issue cycles so it is possible to issue more math or more loads.
The L1 memory system can accept more local/global/shared requests. Either shared memory is currently being used or there are hits is L1 as < 1/3 of the L2 throughput is being used.

I would recommend looking in the Source View page for the areas with highest stall reason which I suspect is long scoreboard.

Topic		Replies	Views
Memory SOL Throughput % Nsight Compute	1	275	August 19, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	9	3758	December 31, 2025
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	1100	April 18, 2023
Measuring L1/SMEM throughput on V100 using nvprof CUDA Programming and Performance	4	765	October 22, 2020
Does the Roofline Model's L1 Cache Bandwidth Include Shared Memory? Nsight Compute	3	143	October 20, 2025
What are the differences among Compute (SM) Throughput, Memory Throughput, and DRAM Throughput in GPU Speed Of Light Throughput in nsight compute？ Nsight Compute cuda , kernel , ncu	0	54	July 8, 2026
Visualisation of Integer based Random Memory Access Kernel Nsight Compute	1	174	January 9, 2025
Compute Visual Profiler- global memory throughput Legacy PGI Compilers (archived)	1	2908	April 14, 2011
Cause of "Compute Memory Access Throughput Internal Activity” being the dominant metric Nsight Compute cuda , kernel	5	102	April 3, 2026
How to determine whether a GEMM is bound on L1 or L2? CUDA Programming and Performance	13	669	June 1, 2024

Same SOL for memory and SM Throughput

Related topics