Improving gather kernel


I have a kernel where each thread has to read four contiguous samples based on a computed euclidean distance.
These four samples are used to interpolate a sub-sample value at the position corresponding to the distance.
I.e. essentially a gather-operation (thread-varying non-aligned indices), where four sequential values are loaded.
And this is done in a loop 168 times with different distances.

I have kind of hit a wall on how to make it faster.
Each thread computes one pixel, and they are organized so they are spatially close, meaning there is a lot of overlap in the loaded samples.

A majority of time in the kernel is spent with approximately 50% long scoreboard stall (where samples are used) and 25% LG throttle, 10% wait, 10% MIO throttle (where LDG instructions are issued);
and the algorithm has very good cache hit rates, 96% L2 hit and 80% L1 hit rate.

  1. Interestingly, using a wider load (128-bit LDG) gave a 5% worse performance. The indices to load are not memory-aligned, but even ignoring this fact (snapping to nearest aligned index rounded down) gave much less LG throttle, but more long scoreboard stall. Maybe just due to less overlap in read data.
  2. Prefetching (with inline asm) just made it slower due to more LG throttle I believe and little scoreboard stall reduction. I guess the LD/ST units are oversubscribed already.
  3. I tried warp-coorperation in different ways, but the cost of communication/synchronization seemed too high. And block-level synchronization is out of the question due to the 168 iterations required.

I am still new to this, and I am not sure how to “diagnose” the problem. I get 76-80% “%Peak to SM” as well as 82% LSU utilization (by far the most used pipeline, only 40% FMA is used).
I should note that there are some scattered data reads in that it is not just four samples that are loaded each time, but four sets of four samples are loaded from essentially strided memory (this was done to reduce the arithmetic instruction count by approximately a factor of four, since these all have the same distance calculation).

The main suspects are

  1. Bandwidth? Sequential load of four samples is not really coalesced when it in theory could be, maybe this results in low bandwidth? Or maybe this is not an issue, since the L1 cache works as a coalescing buffer. (threads are probably requesting the second/third/fourth sample needed by their neighbour).
  2. Cache-misses? With half-precision, there should effectively be twice the cache space and bandwidth. But the hit rate is already so high, can I really expect this would make a great difference?
  3. Memory-latency? The strided loads mean more cache-misses, but I feel like the L2 cache saves us here. Without strided loading, the hit rate was like 99% (because the data almost completely fits in cache), now its a bit lower, but still only 27 MB is loaded from main memory (and then 710MB from L2 to L1). If this is the reason, one might interleave the arrays before processing (I.e. SOA to AOS) to eliminate over-fetching – then a wider LDG instruction could be used as well to reduce LG throttle I think.
  4. LD/ST instruction throughput? Maybe the LD/ST units are overworked with requests? But if this was the case, I think it would show up as LG throttle, not long scoreboard; then again, it does say everywhere that I am using around 80% of these units. I guess wider LDG instructions help improve this?

So the solutions I have in mind is half-precision and interleaving data (permute Nx168x4 to 4xNx168).
Both operations could be done in a single pass over the data, maybe even fused with an earlier kernel.

Is there anything I am overlooking? I feel like I have tried everything, so many experiments not mentioned here.

The device used was an RTX 3090. Attaching an image of the profiling.

It’s not bandwidth. You’re not saturating any of the pipes in the memory chart. The profiler output is telling you that LSU is likely a bottleneck. This isn’t surprising with scattered reads because that is going to drive the transactions per request up, which increases the load on the LSU. You’re likely working with a small dataset/footprint which is where the 97% L2 hit rate is coming from. The 80% L1 is still good. This footprint seems like it can totally reside in L2 so your objective then is to saturate those L1 and L2 pipes. But without an improvement in coalescing I’m not sure its possible.

I don’t expect smaller data sizes (half precision) to help at all. It doesn’t seem like it will help with the footprint which already “fits” in the L2, and the granularity of the transaction is going to round that up. It should have no effect (I don’t think) on the transactions per request multiplier

(1) With those kind of pipe utilizations, you are doing well considering that the loads are not using an optimal base+tid access pattern. It will be hard to squeeze more performance from this code. Unless I am misinterpreting something, your code seems to be memory latency limited, and part of the reason for that seems to be that the problem size is fairly small for a GPU this large. Ideally you would want to have ten of thousands of threads going on a 3090.

(2) make sure all pointer arguments use the __restrict__ attribute to maximize the moveability of loads, which increases resilience with respect to load latency.

(3) Generally speaking, from a performance perspective, SOA arrangements are preferable over AOS arrangements, on both GPUs and CPUs. Re-arranging AOS to SOA can sometimes be a royal pain in overall program context: there is a reason structures were invented where before there were only arrays. If you can deal with the pain, worth a try.

(4) Reducing the storage requirement by using a narrower data type like half looks worth pursuing as an experiment, assuming the lower precision representation is adequate for the use case. Especially when utilization is high, there can be some interesting correlations in the interaction between bandwidth and latency, similar to how cars’ speed slows on a highway near maximum throughput (if memory serves, highways reach their maximum throughput at 25 mph to 35 mph). The example doesn’t perfectly translate 1:1 but similar effects do exist, due to finite length L/S queues, for example.

[Later:] I see @Robert_Crovella posted an answer while I was still typing up mine, and we don’t seem to be in perfect agreement in our respective analyses. I usually don’t do well when working with textual descriptions (I am the kind of person who prefers hands-on interaction with code to get to the bottom of things; I also like and encourage experimentation), so I will defer to his expertise here.

Thanks for the quick response, you guys are awesome.

@njuffa Yes! It is loaded as const restrict, I also tried other cache settings and using the non-constant/texture cache (as it is documented as having lower latency, but higher bandwidth). The data was originally int16 (values <= 8192) and it has to be processed with a convolution first resulting in decimal values, so not a lot is lost with half precision compared to float I think.
With the grid dimensions 117x16x43x8=643968 threads are working. It says “Avg. Threads Executed” is 31 throughout most of the kernel, I hope this does not mean 1 thread is inactive all the time.

What did you mean with “the granularity of the transaction is going to round that up”?

Currently complex numbers are loaded with essentially 4x4 LDG.64 instructions. If I permute the data, it could be loaded with 2x4 LDG.128 instructions (each load gets two samples at a time, which is possible since memory is then aligned at 256-bit boundaries instead of 64-bit). Then with half precision it can be done with just 1x4 LDG.128.

Like you say it is not a memory bandwidth problem, so probably either LSU instruction count (addressed by 1x4 LDG.128) or latency. With latency I’m guessing it is from the L1 misses.

Latency I guess can be addressed with unrolling/pipelining/higher occupancy from the reduced register pressure of half-precision loads. But it is a bit difficult since loading only happens within a branch in my algorithm.
Maybe prefetch is an option when LSU is not being over-used. (i.e. use prefetch “for free” to hide the latency of the current in-flight load, and hopefully reduce the latency of the loads in the next iteration).

1 Like

In my experience software-initiated prefetch is largely useless. Even if you can puzzle out an optimal prefetch distance, it tends to be optimal only for a specific processor variant coupled to a specific memory subsystem, and may even be counter-productive on other hardware configurations. But I have seen reports here in the forums of people achieving non-trivial speed-ups (~10%) using prefetch when tuning for one particular platform.

The superior alternative to software-defined prefetch are hardware prefetch mechanisms that opportunistically exploit available bandwidth (so no prefetch when bandwidth is already maxed out). These are pretty much standard on CPUs these days and have been optimized over the past 20 years for various kinds of typical access patterns. Hopefully coming to GPUs real soon now.

well up to this point you didn’t clearly specify data sizes involved. But I presumed that you had something like float data and were considering changing that to half data.

A miss in the L2, even for 1 byte, does not generate a transaction of 1 byte to main memory. There is a minimum transaction size, so requests of smaller than the transaction size up to requests of the transaction size are all going to look the same on the device memory bus - you’re going to be requesting 32 bytes even if you only need 1 byte. So at first glance, changing a miss from a 4-byte miss to a 2-byte miss isn’t going to make any difference in the work associated with the L2. Up till this point I didn’t know you were talking about complex numbers, so 4 floats is 16 bytes. 4 complex floats is 32 bytes. 4 complex half is 8 bytes. A miss on any of those is going to trigger (at a minimum) 1 transaction to main memory. Naturally, alignment matters too, but a misaligned (with respect to memory segment structure) 32 byte load is going to result in 2 segment requests to main memroy, just like a misaligned 16 byte or 8 byte load is going to result in 2 segment requests to main memory. So at first glance its not obvious to me that switching to half helps the general dynamics for the scattered load case.

We can consider something similar at the L1, but I’m less sure of the minimum transaction size characteristics, and I believe they have varied over time/device architectures. once upon a time (Kepler) a L1 miss generated a minimum 128-byte transaction to the L2. Later (at least by Pascal), I’m pretty sure this got reduced to 32 bytes (minimum). And it might be lower now. I’m not sure a miss smaller than 32 bytes really matters, because the L2 cache line is still 32 bytes AFAIK. Therefore, if you intend to hit in L2, you still need all 32 bytes populated in the L2 cacheline.

The usual advice in the old days was to disable L1 caching for scattered loads, particularly for this characteristic - it reduces the miss penalty/overhead from 128 bytes to 32 bytes. I thought about suggesting it as a test/experiment, but the 80% hit rate in L1 caused me to keep my mouth shut on that idea. If you are requesting an aligned 32 bytes on a modern architecture, I think there is less benefit to disable L1.

1 Like

That would be nice.
You’re probably right; I could try to put loads in-flight instead of prefetch, I just know I’ll have to fight NVCC since things are done inside a branch - pretty much always warp-uniform, but not guaranteed.
This makes it very hard to interleave work for the next iteration, e.g. unrolling the loop, as I just end up even more possible execution paths :/
But basically the branch is an early continue: “if(!have_to_load_samples_and_do_interpolations) continue”, and if one path was taken on iteration N, it will probably be the same path on iteration N+1.
Maybe I can set up a scheme with predication inside the “do work” path - I.e. assume work has to be done every iteration, then use predication to nullify work that was not needed and transition to the “do nothing” branch, which is then similarly unrolled/pipelined for instruction-level-parallelism.

@Robert_Crovella My fault, I didn’t realize it had an impact. I think due to the distance calculations, it is not going to be aligned w.r.t. cache-lines unless by luck; but a warp might still load just one/few segments I think since the LDG address differences/stride in a warp (by the distance calculation) is variable and may be less than one.

From our discussions, the reason for long scoreboard stall is probably

  1. The initial L1/L2 cache misses costing time (lower memory-footprint, prefetch, or more parallelism is needed)
  2. A poor L1 throughput due to the memory access patterns. The way four samples are loaded in sequence mean the path L1->SM is potentially used four times more than needed, and “%Peak to SM” is pretty high.

Number 1 is not as likely since with 7 active warps and some ~25 instructions between LDG and the corresponding stall, means a latency of at least 7*25=175 cycles should be possible to hide, which should be enough to mostly hide an L1 miss (~190 cycles). Also L2 misses are quite rare (but very costly, some ~450 un-hidden stall cycles as a ball-park est.).

Number 2 is more likely since the “%Peak to L2” at 25% is low, but “%Peak to SM” at 80% is quite high, and seems to correspond to “LSU pipeline utilization”, “Mem Busy %” and “Mem Bandwidth %”. Probably fewer, wider requests will help.

Only reason I suspect #1, is that warp stall sampling is pretty much 90%-98% long scoreboard stall on the one instruction, suggesting all warps are “parked” here simultaneously, since they do not experience other stall reasons. The highway effect @njuffa mentioned could be a cause. Hypothesis: They are serviced by the L/S-queue, but the many load requests get interleaved, meaning it is hard for any one thread to finish all its loads and continue its work – potentially creating a congestion.
^ To test if it is the highway effect, I could try disabling the YIELD flag in the opcodes manually, which should eliminate any interleaving in the L/S-queue. I.e. each thread greedily fires off all of its LDG requests on top of the L/S-queue and is then parked afterwards; but it is then also the first one the leave the LDG queue fully, allowing it the opportunity to help hide latency for other threads. This avoids the entanglement effect scheduling failure, which could be an underlying cause!

Thank you for the ideas. I’ll have to try and experiment. Especially the highway effect gives a new direction for me to investigate.