NVCC chooses to use local memory while there is a lot of registers it can use

I have 8 warps per block, the maximum available register number should be 255. In my kernel, I have two half[8] arrays. NVCC maps these arrays into local memory. I understand this behavior if I’m running out of registers. But as my profiling result tells, the peak registers number is only 83. I think it’s definitely unnecessary to use local memory.

  • 83 registers per thread is a lot, and will prevent full occupancy on the SM. 255 registers per thread would cut the maximum occupancy to 25% or less.
  • arrays can only be converted to registers if the indexing is computable at compile time

I don’t agree with it. First of all, indexing is computable at compile time. And 16 half variables only give 32-byte memory size, which should only occupy two128-bit registers. Even considering the bank conflict issue, just occupying 16 registers, the total register number is less than 100. Register number between 81 and 128 gives the same occupancy. I think occupancy is not the root cause.

All thread-local variables are assigned to local memory by default. The compiler pulls some of them into registers as part of the code optimization process. This is guided by various heuristics which are not publicly documented and subject to change at any time. At compile time, there is no notion of occupancy, but there is a (preliminar) notion of register pressure. That is one of the data inputs to these heuristics.

For arrays, there are some hard requirements in order to map them to registers, in particular whether all indexing uses compile-time constants. If hard pre-requisites are met, heuristics using array size and register pressure and other properties decide whether the compiler considers scalarization of the array profitable, i.e. helpful for performance. Mature robust heuristics tend to make the right decision in 90% of cases. At this stage in the evolution of CUDA we can safely assume that the relevant heuristics have been well tuned.

Note that GPU registers always comprise 32 bits. Unless the half-precision data comes in the form of half2 2-vectors, each one would occupy one register. If your code is not too complicated and voluminous, you could do an experiment by scalarizing the arrays by hand to assess the resulting performance. If it turns out that scalarizing the arrays results in a significant reduction in kernel execution time (> 10%), you may want to file an enhancement request with NVIDIA.

1 Like

Thanks again @njuffa.

Actually, they are not pure arrays. They are two instances of ‘wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, half>’, which have the same size of 2x16 bytes.

I fill_fragment first, then do the mma_sync. While fill_fragment, I found the values are stored in the local memory. While mma_sync, I found the Accumulator matrix was loaded from the local memory.

I need more tests for performance reduction, since it only needs 32 bytes for each local memory load and think it will get a cache hit most time. But I think this local memory behavior is unnecessary in my case.

Not true. Considering only register usage (and ignoring register allocation granularity), and assuming 64K registers per SM, 81 registers per thread gives a maximum occupancy of 809 threads per SM, whereas 128 gives a maximum occupancy of 512 threads per SM.

In between register usage numbers of 83 and 99, the occupancy number of 768 threads per SM would be crossed. You would go from a maximum occupancy of three 256 thread blocks to two 256 thread blocks.

Hi Robert! My number comes from CUDA_Occupancy_Calculator.xls’ ‘Impact of Varying Register Count Per Thread’ chart. Also, I’ve already selected 64k shared memory for each block (sm 7.5), which means that the shared memory usage has already limited the active thread block per SM to 1. So, the compiler should know the maximum register number should be 255.

The compiler doesn’t know those things. It does not know what you have entered in a spreadsheet, and since 64K can only be accessed via dynamic shared memory, a runtime specified number, the compiler doesn’t know that either. The enablement function for this feature is simply that - an enablement. It does not guarantee that every kernel you launch will have that amount of shared memory.

However the compiler could know that static shared memory per block is 48KB, and depending on the GPU architecture this might limit you to one threadblock occupancy. However one threadblock occupancy doesn’t automatically indicate 255 registers per thread. The compiler does not know how many threads per block you intend to launch (in the general case, but see below), but single threadblock occupancy would imply an upper bound of 64 registers per thread without consequences.

None of this speaks to allowing anything higher than 64 registers per thread, that I can see, in the general case. Your code is already above that.

If you’re not using __launch_bounds__, it might be interesting to see if there are any differences in compiler behavior by decorating your kernel with that, appropriately.

You are right. I agree that the compiler doesn’t know those things, since the dynamic shared memory size varies from launch to launch.

From my experience, also cuBlas’s, high occupancy doesn’t mean high performance. But usually, high register occupancy brings high performance. Like turing_h1688gemm_256x128_ldg8_tn kernel in cuBlas, it occupies 186 registers and has an occupancy of 25%, but it almost achieves the roofline performance. So I believe increasing occupancy should not be the 1st priority of the compiler.

By adding some condition codes I will not use in the launch time, I can control the kernel whether use local memory or not. And it turns out that using local memory decrease performance by like 2% on average.

The kernel using local memory has 550K instructions for local memory and 7M instructions for global memory. The kernel without local memory just has 7M instructions for global memory. It surprises me that using local memory only affects so little.

Hi Robert, after many tests, I’m okay with the compiler’s local memory strategy. It doesn’t degrade the performance as I imagined. Thanks!