Strange usage of local memory

qiangxinwei · November 9, 2024, 1:50pm

I have met something similar to what is described in an old question:
Strange local memory usage - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
However, I do not know how to solve it or what caused it.
My code looks like this:

        #pragma unroll
        for (int i = lid_start; i != lid_start + io_group; i ++) {
            if (io_id * 4 < WORDS) {
                u32 group_offset = (i >> (deg - 1)) << (log_len - log_stride - 1);
                u32 group_id = i & (subblock_sz - 1);
                u64 gpos = group_offset + (group_id << (log_end_stride));

                thread_data[i - lid_start] = reinterpret_cast<uint4*>(data + gpos * WORDS)[io_id];
            }
        }
        WarpExchangeT(temp_storage_uint4[warp_id]).StripedToBlocked(thread_data, thread_data);
        a = Field::load(reinterpret_cast<u32*>(thread_data));
        __syncwarp();

Here, ‘io_group’ is a constant, so the compiler should unroll the loop.
This code reads io_group big numbers (e.g. 256 bit) with io_group of threads. Then use warpexchange to shuffle the striped number into corresponding threads.

Initually, I use each thread to read an int, and everything worked well. However, when I tried to shift to use each thread to read a uint4, something strange happened. According to ncu report, local memory is used, resulting in bad performance.

I believe it’s not due to a lack of registers, as on sm_89, 255 registers should be available for each thread, while in my case, only 106 are used. The local array should not be a problem because when I used uint, everything went well, so all the address is statically assigned.

So what can cause this?

Robert_Crovella · November 9, 2024, 3:09pm

You may wish to verify that. Use the SASS tools, e.g. cuobjdump -sass ./my_executable

In my experience, 106 registers per thread is a lot, and the compiler may be choosing to not use more than that. An sm_89 GPU has 64K registers per SM, so you could use __launch_bounds__ to instruct the compiler as to your launch intentions, which would give the compiler more information about using more registers. For example if you wanted to “clear the way” to use 128 registers per thread, then let the compiler know that the largest block you intend to launch is 512 threads. You can find information about launch bounds in the programming guide as well as many forum threads.

Those may or may not have any impact on what you are seeing. It’s usually difficult to be prescriptive without an actual test case.

qiangxinwei · November 10, 2024, 2:49am

Thanks for your reply, it turned out that the loop is not unrolled. That’s quite strange because when in the case of int, nvcc can identify the pattern and unroll it automatically while in the case of uint4, I have to give nvcc more hint: I changed the code pattern to the following and finally, usage of local memory is eliminated.

        #pragma unroll
        for (int tti = 0; tti < io_group; tti ++) {
            int i = tti + lid_start;
            if (io_id * 4 < WORDS) {
                u32 group_offset = (i >> (deg - 1)) << (log_len - log_stride - 1);
                u32 group_id = i & (subblock_sz - 1);
                u64 gpos = group_offset + (group_id << (log_end_stride));

                thread_data[tti] = reinterpret_cast<uint4*>(data + gpos * WORDS)[io_id];
            }
        }

Topic		Replies	Views
Strange local memory usage CUDA Programming and Performance	7	1950	September 21, 2009
Fixed size array, registers and function call CUDA Programming and Performance cuda	3	1025	November 25, 2021
C++ Templates and NVCC's use of registers CUDA Programming and Performance	4	1275	January 2, 2017
Register usage problem CUDA Programming and Performance	7	2896	March 27, 2009
Undefined behavior or compiler bug CUDA NVCC Compiler	5	83	December 6, 2024
Automatic loop unrolling CUDA Programming and Performance	3	5328	May 19, 2007
unexpected loop unrolling CUDA Programming and Performance	2	6722	May 3, 2010
Register usage problem after static unroll(code generator) CUDA Programming and Performance	6	1398	July 2, 2009
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1357	March 25, 2022
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2417	October 25, 2017

Strange usage of local memory

Related topics