Hello,
I am working with 64 bit unsigned integers and the memory accesses I am doing resulted in a lot more “L1 Wave excessive Fronts” than I expected.
My memory access pattern would have Thread i accessing two shared memory values at the index i and the index i+BlockDim.x (512 in my test case) of the shared memory array.
From what I have read, since the data type that I am working with is 64 bits, each number will span two shared memory banks which will make the shared memory accesses serialized. Is that correct?
If it is indeed correct, is there any way to have a better access pattern? I have seen padding suggestions but I don’t see how that would help since to me it just looks like I would shift the conflicts. I use an array but a 2D array can also be used if it somehow helps.
Thank you and sorry if my question is confusing
For LDS.64 (Load 64-bit from Shared) or STS.64 (Store 64-bit to Shared) with all threads active predicated on and the address pattern be consecutive 64-bit values will be 2 wavefronts. The Source column L1 Wavefronts Shared Ideal should report 2 per instruction executed and the column L1 Wavefronts Shared Excessive will be 0 per instruction executed.
The return B/W of shared memory on Volta+ is 128B/cycle with an exception on broadcast that can emulate higher bandwidth. In general, if all 32 threads are active and predicated on then the instruction will be split into 2 separate wavefronts to support the return bandwidth.
Hello,
Yes, that is the results I got. I tried to use a 2D array and the bank conflicts disappeared, the access pattern stayed the same. What is the difference between using the 1D array or the 2D array in terms of bank conflicts ?
Thank you
A minimal viable reproducible will be required to investigate further.
Hi, @jomivaan
Please provide a minimal repro if you need further support. Thanks !
Sorry for not answering, I solved the issue myself.
Thanks for the reply !