Strange usage of local memory

I have met something similar to what is described in an old question:
Strange local memory usage - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
However, I do not know how to solve it or what caused it.
My code looks like this:

        #pragma unroll
        for (int i = lid_start; i != lid_start + io_group; i ++) {
            if (io_id * 4 < WORDS) {
                u32 group_offset = (i >> (deg - 1)) << (log_len - log_stride - 1);
                u32 group_id = i & (subblock_sz - 1);
                u64 gpos = group_offset + (group_id << (log_end_stride));

                thread_data[i - lid_start] = reinterpret_cast<uint4*>(data + gpos * WORDS)[io_id];
            }
        }
        WarpExchangeT(temp_storage_uint4[warp_id]).StripedToBlocked(thread_data, thread_data);
        a = Field::load(reinterpret_cast<u32*>(thread_data));
        __syncwarp();

Here, ‘io_group’ is a constant, so the compiler should unroll the loop.
This code reads io_group big numbers (e.g. 256 bit) with io_group of threads. Then use warpexchange to shuffle the striped number into corresponding threads.

Initually, I use each thread to read an int, and everything worked well. However, when I tried to shift to use each thread to read a uint4, something strange happened. According to ncu report, local memory is used, resulting in bad performance.

I believe it’s not due to a lack of registers, as on sm_89, 255 registers should be available for each thread, while in my case, only 106 are used. The local array should not be a problem because when I used uint, everything went well, so all the address is statically assigned.

So what can cause this?

You may wish to verify that. Use the SASS tools, e.g. cuobjdump -sass ./my_executable

In my experience, 106 registers per thread is a lot, and the compiler may be choosing to not use more than that. An sm_89 GPU has 64K registers per SM, so you could use __launch_bounds__ to instruct the compiler as to your launch intentions, which would give the compiler more information about using more registers. For example if you wanted to “clear the way” to use 128 registers per thread, then let the compiler know that the largest block you intend to launch is 512 threads. You can find information about launch bounds in the programming guide as well as many forum threads.

Those may or may not have any impact on what you are seeing. It’s usually difficult to be prescriptive without an actual test case.

1 Like

Thanks for your reply, it turned out that the loop is not unrolled. That’s quite strange because when in the case of int, nvcc can identify the pattern and unroll it automatically while in the case of uint4, I have to give nvcc more hint: I changed the code pattern to the following and finally, usage of local memory is eliminated.

        #pragma unroll
        for (int tti = 0; tti < io_group; tti ++) {
            int i = tti + lid_start;
            if (io_id * 4 < WORDS) {
                u32 group_offset = (i >> (deg - 1)) << (log_len - log_stride - 1);
                u32 group_id = i & (subblock_sz - 1);
                u64 gpos = group_offset + (group_id << (log_end_stride));

                thread_data[tti] = reinterpret_cast<uint4*>(data + gpos * WORDS)[io_id];
            }
        }