When I use ncu to profile my kernel, I got warning below:
This kernel has uncoalesced shared accesses resulting in a total of 172352896 excessive wavefronts (38% of the total 458925430 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations.
I check the table and find the instructions cause most of the uncoalesced shared access was async load instruction & ldmatrix. I wondered what is uncoalesced shared access? How can I avoid it?
For async instruction, each thread asks for 16B data from gmem. Do I need to make each warp ask for several 128B cachelines like the direct loading do? For ldmatrix, I don’t know what to do at all.
Is “uncoalesced shared access” shared memory or global memory?
I would check the offsets used in the async load instruction.
You say, you load 16B per thread. Is the element size 16B or is it 4 * 4 or 2 * 8 bytes with an element size of 4 or 8 bytes?
If you take the element size (and assume each thread only loads one element), consecutive (and aligned to 32 bytes) blocks of 32 bytes as a whole should be loaded into the 32 threads. Your data has gaps or too many totally different offsets.
Maybe global memory? I haven’t heard something like uncoalesced access happened on shared memory. Actually, I require for 4 * 4B data from global memory per thread. However, these data only aligned to 32B. Will this cause serious loading problem?
With current (Volta onwards) CUDA architectures loading 32B data (even with unrelated addresses) is quite fine. You get full bandwidth. And only minor inefficiencies compared to 128B alignment.
You have to make sure that the whole 32B is loaded with one instruction and not split.
Your options are:
16B loads (e.g. uint4) and two threads together load 32B. 16 = (32 threads / 2 threads) completely independent addresses.
4B loads and eight threads together load 32B. 4 = (32 threads / 8 threads) completely independent addresses.
8B loads and four threads together load 32B. 8 completely independent addresses.
Depending on 16B/8B/4B, the alignment per thread has to be that same number, e.g. uint4 has to be aligned by 16 bytes.
For processing you can shift the data between threads with shuffle instructions or shared memory. But the global memory access itself should keep the 32B blocks in one load instruction.