TEX is top SOL, but 99% Tex hit rate?

Hi.

I’m not entirely sure how to work around this bottleneck. TEX is my top SOL, but I have a 99% Tex Hit Rate, and a 74% L2 Hit rate.
Each thread in a wave is loading a different float4, but there are waves that load the same float4, but it’s not too common. Each wave of 32 threads is loading 32x float4 in contiguous memory, and changing to this memory layout has already improved things from what it was before. But this load is the one that’s showing up as a top hot spot with LGTHR being the top stall. I’m using HLSL SM6 so it’s compiling to a dx.op.rawBufferLoad.f32
So am I just issuing too many loads? I’m probably missing something obvious.

Any suggestions would be greatly appreciated, thanks!

I’ve edited my original post as I was incorrectly describing what my shader was doing, but the problem remains unchanged.

Hello,
Thank you for using Nsight Graphics and thank you for your questions. I’ll discuss your post with the engineering team and get back with you and will also let you know if we need any additional info from you.
Regards,

Hi,

Thanks dwoods, I look forward to the response. In the meantime I realise my problem has only really been diagnosed with NSight Graphics, and the question is more about optimization on nvidia hardware with compute in DX12, is there a better place to ask questions regarding optimization? I see a DX related forum, but it’s mostly about bugs.

Hi stinkz, welcome to the forum and thanks for providing this great level of detail.
Could you also share which GPU, and which portions of the tool you’re using? (GPU Trace, Range Profiler, Shader Profiler)

From your description, it sounds like this shader is saturating the LSU side of the pipe, via global memory accesses.

LG Throttle does imply “issuing too many loads or stores”. If every thread is loading a non-overlapping float4, and they are all contiguous, in the best case you’d have 128-bit load instructions, 4 cachelines per warp-wide instruction. One way to measure this approximately, is taking the ratio of l1tex__data_pipe_lsu_wavefronts.sum / sm__inst_executed_pipe_lsu.sum. (metrics available via range profiler user metrics

If the algorithm requires every thread to read a non-overlapping float4 exactly once, ultimately the algorithm will be limited by either VRAM or L2 bandwidth. And L2 bandwidth only achievable if the dataset fully fits && still cached from previous writes.

If there is any reuse of values, groupshared memory may still provide some speedup (versus refetching).

I’d also encourage someone with your level of expertise to sign up for the Pro version of Nsight Graphics. It will reveal shader assembly instructions, and a whole lot more metrics.

Additional resources that may be of interest:
User Guide :: Nsight Graphics Documentation – shader stall reason advice
Kernel Profiling Guide :: Nsight Compute Documentation – block diagram of L1TEX
Requests, Wavefronts, Sectors Metrics: Understanding and Optimizing Memory-Bound Kernels with Nsight Compute | NVIDIA On-Demand – CUDA talk that explains L1TEX in detail

Hey abaliga, that’s some great feedback, thank you!

I’m using an RTX 3080, and I’m using the “Profile Shaders” and “Profile Pipeline” features of Nsight Graphics. The compute work I need to run is done in one giant dispatch, so those are really the only tools I’m using.

My code is doing 1 load per lane, as described in the CUDA talk you linked where each load from a lane is right next to its neighbouring lane - this is the memory layout I changed to as described in my original post, I saw great speed improvements with it. I’m also doing 1 load per warp, so technically it’s 2 loads per lane, with one of those loads being the same load for every lane. The entire data being loaded isn’t too big, and should fit entirely in L2, which is why I get such great L1Tex and L2 hit rates.

My ratio of l1tex__data_pipe_lsu_wavefronts.sum / sm__inst_executed_pipe_lsu.sum came out to be 1251221731.5 / 332548608 = 3.76, which isn’t too bad, and I think it should match my code description above? Thanks for pointing out how to look in to that, it’s super interesting!

Since my dispatch is so large, I opted to reduce the size of the dispatch by doing multiple iterations of work per warp with both a rolled loop and inside an unrolled loop. I believe this may have helped absorb some of the load latency with ALU work, but unfortunately not enough and my problem still remains the same.
It still seems like I’m asking too much from the hardware, I tried rewriting the algorithm in a manner which would reduce the amount of loads by half, but I was seeing a LOT of contention in writing out the results of the computation which was to be expected. By doing it the “dumb” way and doubling the loads/work, I saw incredibly little output contention, and it was significantly faster too, despite the heavy LGTHR stalls.

I will have to think about your suggestion of trying to leverage groupshared memory for potential refetches, I’m not doing refetches in each warp, but between warps I am, although it’s not common.

I was wondering why I couldn’t find the shader assembly, there’s a Pro version of Nsight Graphics? Sorry but I can’t seem to find anything relating to it, could you direct me as to where I can find it?

Thanks again for the input! If you have any more please let me know.

I was wondering why I couldn’t find the shader assembly, there’s a Pro version of Nsight Graphics? Sorry but I can’t seem to find anything relating to it, could you direct me as to where I can find it?

Didn’t mean to leave you hanging … please email us at NsightGraphics@nvidia.com , and share your name, company name, and either the product you’re working on (if public), or the type of software otherwise.

technically it’s 2 loads per lane, with one of those loads being the same load for every lane.

The same load means “same address on all lanes”, right? (we also call that a uniform load)
Just an idea: For the “same load for every lane”, in theory each warp could instead load a separate value per lane in one instruction, and hold onto them for 32 iterations of the algorithm. That would reduce the total # of load instructions to around half, removing pressure from the LSU input fifo (i.e. the LG Throttle).