Why the latency of LDL instruction so low

2075787455 · January 21, 2025, 1:41am

Generally speaking, the local memory exclusive to threads and global memory are actually in the same area, with global memory load latency > cache > register. I wrote code to test the read and write latency of local memory, and the vast majority of instructions in the SASS code loop are LDL instructions. The test results for load latency of local memory should be relatively accurate, but the results show that the latency of the LDL instruction is only three to four cycles. I’m really confusing!!!
by the way ,on 4090

njuffa · January 21, 2025, 2:48am

It is not at all clear what and how you are measuring. For answers that are not wholly speculative, you would want to show the actual code you are using.

If this is a microbenchmark simply based on STL feeding into a dependent LDL, consider the possibility of store-to-load forwarding. This allows the load data to be supplied directly from the store queue instead of taking a round trip through the L1 cache. This is a common processor optimization. Whether GPUs are known to implement it, I could not say. The stated latency is roughly in line with the STLF hypothesis.

2075787455 · January 21, 2025, 3:30am

Thank you for responding so promptly！
i am actually a fresher in CUDA hhh
i will show my cuda and sass code
cuda kernel<<<1,1>>>

__global__ void kernel (..., array result, volatile int seed) {
volatile uint32_t array000[ARRAY_SIZE];
...
volatile uint32_t array199[ARRAY_SIZE];

uint32_t a000;
...
uint32_t a199;

#pragma unroll 1
for (uint32_t i = 0; i < ITERS; i++) {

a000 = array000[i % ARRAY_SIZE];
...
a199 = array199[i % ARRAY_SIZE];

if (seed > LARGE_NUM) {  //never run
array000[i % ARRAY_SIZE] = i;
...
array199[i % ARRAY_SIZE] = i;
}
}
result[0] += a000 + ... + a199;
}

sass code

.L0
IMAD
IADD
some instructions
LDL R0, [R13+0x8];
200 LDLs
some instructions
@P0 BRA '(.L0);

I believe that the error in my code testing will be minimal because the density of the target instructions is extremely high.
The test results show that the latency of LDL is 3 to 4 cycles.
look forward to your suggestions and ideas. Thank you.
by the way, the latency is lower than L1cache’s. So it shouldn’t be due to the cache, right?

system · February 4, 2025, 3:31am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Clock() and Clock64() Functions CUDA Programming and Performance cuda	10	1277	March 13, 2024
Local Memory and Global Memory It is about the speed between local memory and global memory CUDA Programming and Performance	1	1030	February 7, 2012
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	3995	August 1, 2009
Measuring global memory access speed CUDA Programming and Performance	9	2050	October 25, 2018
Global memory latency CUDA Programming and Performance	0	789	January 9, 2012
shared memory latency CUDA Programming and Performance	7	5895	May 18, 2011
Global memory access cost CUDA Programming and Performance	4	2840	November 18, 2017
Instruction cache and instruction fetch stalls CUDA Programming and Performance	2	1833	June 26, 2019
Improving 'Stall Long Scoreboard' by warp level communication CUDA Programming and Performance cuda , kernel , performance , profiling	3	2996	October 31, 2021
Variable run time for cuda kernel Jetson AGX Orin cuda	3	1143	March 2, 2023

Why the latency of LDL instruction so low

Related topics