Hey abaliga, that’s some great feedback, thank you!
I’m using an RTX 3080, and I’m using the “Profile Shaders” and “Profile Pipeline” features of Nsight Graphics. The compute work I need to run is done in one giant dispatch, so those are really the only tools I’m using.
My code is doing 1 load per lane, as described in the CUDA talk you linked where each load from a lane is right next to its neighbouring lane - this is the memory layout I changed to as described in my original post, I saw great speed improvements with it. I’m also doing 1 load per warp, so technically it’s 2 loads per lane, with one of those loads being the same load for every lane. The entire data being loaded isn’t too big, and should fit entirely in L2, which is why I get such great L1Tex and L2 hit rates.
My ratio of l1tex__data_pipe_lsu_wavefronts.sum / sm__inst_executed_pipe_lsu.sum came out to be 1251221731.5 / 332548608 = 3.76, which isn’t too bad, and I think it should match my code description above? Thanks for pointing out how to look in to that, it’s super interesting!
Since my dispatch is so large, I opted to reduce the size of the dispatch by doing multiple iterations of work per warp with both a rolled loop and inside an unrolled loop. I believe this may have helped absorb some of the load latency with ALU work, but unfortunately not enough and my problem still remains the same.
It still seems like I’m asking too much from the hardware, I tried rewriting the algorithm in a manner which would reduce the amount of loads by half, but I was seeing a LOT of contention in writing out the results of the computation which was to be expected. By doing it the “dumb” way and doubling the loads/work, I saw incredibly little output contention, and it was significantly faster too, despite the heavy LGTHR stalls.
I will have to think about your suggestion of trying to leverage groupshared memory for potential refetches, I’m not doing refetches in each warp, but between warps I am, although it’s not common.
I was wondering why I couldn’t find the shader assembly, there’s a Pro version of Nsight Graphics? Sorry but I can’t seem to find anything relating to it, could you direct me as to where I can find it?
Thanks again for the input! If you have any more please let me know.