I have met something similar to what is described in an old question:
Strange local memory usage - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
However, I do not know how to solve it or what caused it.
My code looks like this:
#pragma unroll
for (int i = lid_start; i != lid_start + io_group; i ++) {
if (io_id * 4 < WORDS) {
u32 group_offset = (i >> (deg - 1)) << (log_len - log_stride - 1);
u32 group_id = i & (subblock_sz - 1);
u64 gpos = group_offset + (group_id << (log_end_stride));
thread_data[i - lid_start] = reinterpret_cast<uint4*>(data + gpos * WORDS)[io_id];
}
}
WarpExchangeT(temp_storage_uint4[warp_id]).StripedToBlocked(thread_data, thread_data);
a = Field::load(reinterpret_cast<u32*>(thread_data));
__syncwarp();
Here, ‘io_group’ is a constant, so the compiler should unroll the loop.
This code reads io_group big numbers (e.g. 256 bit) with io_group of threads. Then use warpexchange to shuffle the striped number into corresponding threads.
Initually, I use each thread to read an int, and everything worked well. However, when I tried to shift to use each thread to read a uint4, something strange happened. According to ncu report, local memory is used, resulting in bad performance.
I believe it’s not due to a lack of registers, as on sm_89, 255 registers should be available for each thread, while in my case, only 106 are used. The local array should not be a problem because when I used uint, everything went well, so all the address is statically assigned.
So what can cause this?