Why the latency of LDL instruction so low

Generally speaking, the local memory exclusive to threads and global memory are actually in the same area, with global memory load latency > cache > register. I wrote code to test the read and write latency of local memory, and the vast majority of instructions in the SASS code loop are LDL instructions. The test results for load latency of local memory should be relatively accurate, but the results show that the latency of the LDL instruction is only three to four cycles. I’m really confusing!!!
by the way ,on 4090

It is not at all clear what and how you are measuring. For answers that are not wholly speculative, you would want to show the actual code you are using.

If this is a microbenchmark simply based on STL feeding into a dependent LDL, consider the possibility of store-to-load forwarding. This allows the load data to be supplied directly from the store queue instead of taking a round trip through the L1 cache. This is a common processor optimization. Whether GPUs are known to implement it, I could not say. The stated latency is roughly in line with the STLF hypothesis.

1 Like

Thank you for responding so promptly!
i am actually a fresher in CUDA hhh
i will show my cuda and sass code
cuda kernel<<<1,1>>>

__global__ void kernel (..., array result, volatile int seed) {
volatile uint32_t array000[ARRAY_SIZE];
...
volatile uint32_t array199[ARRAY_SIZE];

uint32_t a000;
...
uint32_t a199;

#pragma unroll 1
for (uint32_t i = 0; i < ITERS; i++) {

a000 = array000[i % ARRAY_SIZE];
...
a199 = array199[i % ARRAY_SIZE];

if (seed > LARGE_NUM) {  //never run
array000[i % ARRAY_SIZE] = i;
...
array199[i % ARRAY_SIZE] = i;
}
}
result[0] += a000 + ... + a199;
}

sass code

.L0
IMAD
IADD
some instructions
LDL R0, [R13+0x8];
200 LDLs
some instructions
@P0 BRA '(.L0);

I believe that the error in my code testing will be minimal because the density of the target instructions is extremely high.
The test results show that the latency of LDL is 3 to 4 cycles.
look forward to your suggestions and ideas. Thank you.
by the way, the latency is lower than L1cache’s. So it shouldn’t be due to the cache, right?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.