I recently worked on a CUDA application and used live registers # at Night Compute source/SASS page to identify the register hotspot. I’ve noticed that most instructions at SASS used around 100 live registers except the CALL.ABS.NOINC instruction which used 249 live registers (see the attached screenshot). Does anyone have idea about the CALL.ABS.NOINC call? Is there any way we can avoid it? Thanks
call refers to a function call
if you use nvdisasm you will be able to see the target of that particular call (or you may be able to quickly deduce it from the source file itself. In that nvdisasm doc section, you can see a call.abs.noinc disassembled instruction in the example.)
The compiler normally attempts to inline function calls, but one of the cases were it may not inline a function call is if you are using
-rdc=true. If that is the case, then the way to get the register usage down would be to allow the call to be inlined by moving the function target to the same compilation unit as the call to that function.
There might be other cases where the call is not inlined. The extra register usage may be due to the parameters stored for the function call or registers used by the called function.
That looks like a call to a called subroutine that performs floating-point division. These need a fair amount of registers (even more so in the case of double-precision division). There is nothing that can done about these subroutines; they already use as few registers as is possible for a high-performance implementation.
If the register pressure constitutes a real problem, you could try replacing the division by multiplication with the reciprocal (the reciprocal subroutine requires fewer registers compared to the division subroutine). You could also try compiling with
-ftz=true or even
-use_fast_math=true, which however trigger a wider set of effects that you may not desire for this code (see the documentation).
Indeed. The occupancy is limited by register usage. That’s why I started to investigate the hotspot for register usage. Many thanks for the suggestion. I will try to add the compiling flags and see if we can alleviate register pressure issue. btw, I could not see the link to the document you referred to. Thanks
Hi, njuffa, thanks for your insightful comments. Can you update the document link? Thanks
Does inlining usually use less registers than a function call? Another related question, when I write a kernel function, should I put all source code within the kernel instead of wrapping part of them in device functions and calling those device functions in the kernel?
inlining has a number of benefits. It certainly has the potential to use less registers than a function call. However, in the case of floating point division, you don’t have much control over compiler behavior here.
Of course, it is good coding practice to use functions within C++ for code modularity. If you make function calls in kernel code to functions that are defined in the same compilation unit (file/module) and don’t compile with
-rdc=true there should be no downside. I also personally wouldn’t be afraid to use
-rdc=true. It’s an important feature, and you should only be concerned about it if you can demonstrate that the additional function call overhead that it may impose is actually important, performance-wise, for your kernel code.
In my case, the device functions called in a kernel code are defined in a seperate file. What would be the performance penalty for this coding style, if there is any? Would you suggest put the device functions in the same compilation unit/file as the kernel code?
It’s best to try it out. You can find questions about this on various forums if you do some research. You can demonstrate edge cases that show almost no overhead all the way to 90% overhead. So it really depends on your code.
If they are defined in a separate file but you are
#include -ing that file, there is no concern there. The key differentiator is if you are compiling with
-rdc=true, or not.
Of course, none of this discussion applies to your usage of floating-point division.
Got it. Let me search on the forums and play with different setups. It is interesting to learn how things are translated behind the compiler.
This is an example. Not much hard data there though. Also,
nvcc recently introduced link-time-optimization options to attempt to help mitigate some of the effects.
Thanks for the example. Will read it carefully. Very interesting.
I think I demonstrated some while ago that, from a performance perspective, single-precision floating-point division would benefit (fairly significantly for the use case discussed in a past forum thread) from having the fast path inlined and only the slow path remain as a called subroutine.
However, I would expect that if that approach were to be adopted, it would probably not help with register pressure at all, because the slow path would remain as the “fat point” of static register usage . Even if the code were to be inlined entirely (not attractive for various reasons such as instruction cache usage) the need for a bunch of temporary variables inside division algorithms would not go away.
The underlying issue is a conscious trade-off made by the GPU architects to minimize hardware complexity by forcing all divisions (integer and floating-point) to be implemented in software and thus move complexity to that side. Which, when viewed across the totality of use cases that benefit from GPU acceleration, is absolutely the right decision to make.
First an update about the earlier suggestion for using -use_fast_math. It does reduce the live register # in the source page and improve the performance by 30% ( though I will need to make sure the answer is correct.)
A follow up question for a related topic (please advise if I should create a new topic for this question). Since the register usage is the limiting factor for low occupancy in this case, I try to use two approaches to control the register # per thread. 1) set maxrregcount 64 in the NVCC compiler flag 2) use launch_bounds(512, 2). My understanding is both approaches should limit the register # per thread to 64. However, the first approach works well, while the second caused the program to crash. Any suggestion of where I can possibly find the cause of crash? Thanks
What is the exact nature of this “crash”? Does the kernel launch fail with a “too many resources requested” error? If so, what does the host code do in response to a failed kernel launch?
In general I would advise against the usage of
__launch_bounds__ with modern GPUs. In all likelihood what will happen is that increased occupancy is traded for reduced coda execution speed, with the result that there is no increase in actual performance. Occupancy correlates only loosely with performance.
An experiment cannot hurt, but the performance results are not necessarily portable to a different GPU, especially if that GPU has a different architecture.
The program crashed not at the kernel launch but at the beginning where cudaMalloc is followed by a cudaMemcpy. The experiment shows that adjusting the maxrregcount could improve the performance by 10%. But you are right this is not portable and a different number might be needed when running on different architectures.
“Crash” is a very imprecise description, and therefore best avoided in discussions of program behavior. I assume what happens here is that the
cudaMalloc fails and sets an appropriate status, which however is not checked, so that the execution proceeds to the
cudaMemcpy it is passed an invalid pointer which causes a segmentation fault which causes the program to terminate.
If so: At the moment I fail to see any connection between the use of
__launch_bounds__ and the failure of a
cudaMalloc call. Generally speaking, it is a best practice to check the status return of calls to allocation functions and have the program take appropriate action.
Yes, that is exactly what I have observed, a segmentation fault at cudaMemcpy. I will follow your suggestion and check the return status for all CUDA calls and keep you posted for more details. Thanks!