To reduce register pressure, I moved some frequently accessed data into shared memory. When I disabled compiler optimizations, it significantly reduced register usage and also greatly improved efficiency.
set(CMAKE_CUDA_FLAGS “${CMAKE_CUDA_FLAGS} O0 -g -G -keep”)
However, when I use the default compilation options, the register usage is even higher than before I used shared memory. I don’t fully understand why this is happening, but the performance is definitely better than when compiler optimizations were disabled.
In general, the CUDA compiler makes excellent trade-offs between performance and register usage. This means that attempts to squeeze register usage down are likely to highly likely to result in decreased performance, and this often applies even if higher occupancy is achieved.
This means that one needs to inspect the full code to come up with ideas that are even worth trying. Whatever you try, all performance assessments should always use release builds with full optimization. When it comes to performance, there is no useful information to be gleaned from debug builds.
Are you saying that your debug build actually runs faster than the release build of the exact same code? I would consider that near impossible; I certainly have never run across such a case.
When you say, register pressure has increased, do you mean, that it more likely uses local memory?
As you have an involved calculation, try to find out from SASS code, what values are stored in the registers and whether they make sense to be kept there.
The usual method to do that is either by using __launch_bounds__ or by using the -maxrregcount compile switch. Both of these are documented and also have numerous forum questions discussing them, so I won’t provide further discussion here.
As already indicated, casual usage of these methods is not likely to increase the performance of your code. On the contrary, their usage is more typically (in my experience) associated with reduced performance.
That’s typical, in any scenario. Trying to do serious coding analysis or code optimization analysis in my experience has little or no intersection with the use of -G. I don’t recommend it. It’s unlikely to give you meaningful guidance.
There might have been a misunderstanding in my previous expression. The original code didn’t use shared memory, but compiler optimizations were enabled. After I made changes to use shared memory and disabled compiler optimizations, there was a significant performance improvement. In fact, if I enable compiler optimizations again after these changes, the performance will improve even further.
下面Nsight Compute给出的数据。
Compute(SM):7%
Memory:50%
This is the data obtained with compiler optimizations enabled.
My code is not fully utilizing the GPU’s capabilities; in fact, it can be considered quite inefficient.
Here are the recommendations provided by Nsight:
On average, each warp of this kernel spends 92.7 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 90.4% of the total average of 102.5 cycles between issuing two instructions.
I believe the issue lies in the excessive use of registers, which has led to an Achieved Occupancy of only 41%. Given the GPU’s register count, each thread should ideally use 32 registers to achieve 100% SM utilization. However, currently, each thread is using 72 registers.
I have tried using the -maxrregcount option to limit the register usage. From the generated assembly instructions, it appears that the excess registers (beyond 32) are being placed in local memory. These data were originally in shared memory, and I believe local memory is not as efficient as shared memory. I don’t quite understand why the compiler is doing this. The result is not good at all—quite the opposite.
Another issue is related to the size of Waves Per SM. Is a larger value better or smaller?
From the data I’ve gathered, it seems that a larger value is better. When I split the loops in my code so that each thread performs only one operation, the efficiency is highest. However, this results in a significantly larger total number of threads.
But my experience tells me that thread switching incurs overhead. The optimal approach should be to use just enough threads to fully utilize the hardware resources, and then have each thread perform as many iterations as possible. However, in practice, this approach does not achieve better efficiency than the first one, which is quite puzzling to me.
Thank you for your reply. I have tried using -maxrregcount to limit the register usage, but the results were not good because the overflowed registers are placed in local memory.
I have provided a detailed description of my issue in another reply.
Your slow performance was neither because of register pressure nor occupancy nor missing optimization settings, but because you use dynamic indices to access arrays.
With dynamic I mean ones, which are not easily known and constant between all threads at compile-time or even dependent on loop iterators.
This prevents the arrays from being stored in registers and local memory (which is situated in global memory) has to be used.
Shared memory is another memory type, which can use dynamic indices, but I would believe that your accesses are still quite ineffective due to bank conflicts and probably you have quite a lot of shared memory accesses, so it could still (with preventing bank conflicts) be the bottleneck at the end.