CALL.ABS.NOINC instruction in SASS

beiwang · September 14, 2021, 8:01pm

I recently worked on a CUDA application and used live registers # at Night Compute source/SASS page to identify the register hotspot. I’ve noticed that most instructions at SASS used around 100 live registers except the CALL.ABS.NOINC instruction which used 249 live registers (see the attached screenshot). Does anyone have idea about the CALL.ABS.NOINC call? Is there any way we can avoid it? Thanks

Robert_Crovella · September 14, 2021, 8:10pm

call refers to a function call

if you use nvdisasm you will be able to see the target of that particular call (or you may be able to quickly deduce it from the source file itself. In that nvdisasm doc section, you can see a call.abs.noinc disassembled instruction in the example.)

The compiler normally attempts to inline function calls, but one of the cases were it may not inline a function call is if you are using -rdc=true. If that is the case, then the way to get the register usage down would be to allow the call to be inlined by moving the function target to the same compilation unit as the call to that function.

There might be other cases where the call is not inlined. The extra register usage may be due to the parameters stored for the function call or registers used by the called function.

njuffa · September 14, 2021, 8:12pm

That looks like a call to a called subroutine that performs floating-point division. These need a fair amount of registers (even more so in the case of double-precision division). There is nothing that can done about these subroutines; they already use as few registers as is possible for a high-performance implementation.

If the register pressure constitutes a real problem, you could try replacing the division by multiplication with the reciprocal (the reciprocal subroutine requires fewer registers compared to the division subroutine). You could also try compiling with -ftz=true or even -use_fast_math=true, which however trigger a wider set of effects that you may not desire for this code (see the documentation).

beiwang · September 14, 2021, 8:18pm

Indeed. The occupancy is limited by register usage. That’s why I started to investigate the hotspot for register usage. Many thanks for the suggestion. I will try to add the compiling flags and see if we can alleviate register pressure issue. btw, I could not see the link to the document you referred to. Thanks

beiwang · September 14, 2021, 8:25pm

Hi, njuffa, thanks for your insightful comments. Can you update the document link? Thanks

njuffa · September 14, 2021, 9:03pm

beiwang · September 15, 2021, 5:33pm

Does inlining usually use less registers than a function call? Another related question, when I write a kernel function, should I put all source code within the kernel instead of wrapping part of them in device functions and calling those device functions in the kernel?

Robert_Crovella · September 15, 2021, 5:50pm

inlining has a number of benefits. It certainly has the potential to use less registers than a function call. However, in the case of floating point division, you don’t have much control over compiler behavior here.

Robert_Crovella · September 15, 2021, 6:16pm

Of course, it is good coding practice to use functions within C++ for code modularity. If you make function calls in kernel code to functions that are defined in the same compilation unit (file/module) and don’t compile with -rdc=true there should be no downside. I also personally wouldn’t be afraid to use -rdc=true. It’s an important feature, and you should only be concerned about it if you can demonstrate that the additional function call overhead that it may impose is actually important, performance-wise, for your kernel code.

beiwang · September 15, 2021, 6:29pm

In my case, the device functions called in a kernel code are defined in a seperate file. What would be the performance penalty for this coding style, if there is any? Would you suggest put the device functions in the same compilation unit/file as the kernel code?

Robert_Crovella · September 15, 2021, 6:31pm

It’s best to try it out. You can find questions about this on various forums if you do some research. You can demonstrate edge cases that show almost no overhead all the way to 90% overhead. So it really depends on your code.

If they are defined in a separate file but you are #include -ing that file, there is no concern there. The key differentiator is if you are compiling with -dc or -rdc=true, or not.

Of course, none of this discussion applies to your usage of floating-point division.

beiwang · September 15, 2021, 6:40pm

Got it. Let me search on the forums and play with different setups. It is interesting to learn how things are translated behind the compiler.

Robert_Crovella · September 15, 2021, 6:47pm

This is an example. Not much hard data there though. Also, nvcc recently introduced link-time-optimization options to attempt to help mitigate some of the effects.

beiwang · September 15, 2021, 7:49pm

Thanks for the example. Will read it carefully. Very interesting.

njuffa · September 15, 2021, 8:13pm

I think I demonstrated some while ago that, from a performance perspective, single-precision floating-point division would benefit (fairly significantly for the use case discussed in a past forum thread) from having the fast path inlined and only the slow path remain as a called subroutine.

However, I would expect that if that approach were to be adopted, it would probably not help with register pressure at all, because the slow path would remain as the “fat point” of static register usage . Even if the code were to be inlined entirely (not attractive for various reasons such as instruction cache usage) the need for a bunch of temporary variables inside division algorithms would not go away.

The underlying issue is a conscious trade-off made by the GPU architects to minimize hardware complexity by forcing all divisions (integer and floating-point) to be implemented in software and thus move complexity to that side. Which, when viewed across the totality of use cases that benefit from GPU acceleration, is absolutely the right decision to make.

beiwang · September 21, 2021, 6:26pm

First an update about the earlier suggestion for using -use_fast_math. It does reduce the live register # in the source page and improve the performance by 30% ( though I will need to make sure the answer is correct.)

A follow up question for a related topic (please advise if I should create a new topic for this question). Since the register usage is the limiting factor for low occupancy in this case, I try to use two approaches to control the register # per thread. 1) set maxrregcount 64 in the NVCC compiler flag 2) use launch_bounds(512, 2). My understanding is both approaches should limit the register # per thread to 64. However, the first approach works well, while the second caused the program to crash. Any suggestion of where I can possibly find the cause of crash? Thanks

njuffa · September 21, 2021, 6:31pm

What is the exact nature of this “crash”? Does the kernel launch fail with a “too many resources requested” error? If so, what does the host code do in response to a failed kernel launch?

In general I would advise against the usage of -maxrregcount or __launch_bounds__ with modern GPUs. In all likelihood what will happen is that increased occupancy is traded for reduced coda execution speed, with the result that there is no increase in actual performance. Occupancy correlates only loosely with performance.

An experiment cannot hurt, but the performance results are not necessarily portable to a different GPU, especially if that GPU has a different architecture.

beiwang · September 21, 2021, 7:40pm

The program crashed not at the kernel launch but at the beginning where cudaMalloc is followed by a cudaMemcpy. The experiment shows that adjusting the maxrregcount could improve the performance by 10%. But you are right this is not portable and a different number might be needed when running on different architectures.

njuffa · September 21, 2021, 7:48pm

“Crash” is a very imprecise description, and therefore best avoided in discussions of program behavior. I assume what happens here is that the cudaMalloc fails and sets an appropriate status, which however is not checked, so that the execution proceeds to the cudaMemcpy it is passed an invalid pointer which causes a segmentation fault which causes the program to terminate.

If so: At the moment I fail to see any connection between the use of __launch_bounds__ and the failure of a cudaMalloc call. Generally speaking, it is a best practice to check the status return of calls to allocation functions and have the program take appropriate action.

beiwang · September 21, 2021, 8:05pm

Yes, that is exactly what I have observed, a segmentation fault at cudaMemcpy. I will follow your suggestion and check the return status for all CUDA calls and keep you posted for more details. Thanks!

Topic		Replies	Views
Contraditory register count report when calling a non-inlined function Nsight Compute	6	1357	April 26, 2024
Register usage spike in SASS with divison slow/full path CUDA Programming and Performance cuda	13	258	December 12, 2025
Register usage of a device function for vector rotation CUDA Programming and Performance	14	797	June 12, 2022
Register usage and mixing small and large kernels CUDA NVCC Compiler	0	61	June 23, 2025
Register usage Registers used for code that doesn't get run? (if's) CUDA Programming and Performance	7	1724	March 4, 2009
Register usage for same function with different launches CUDA Programming and Performance	5	884	March 18, 2014
Live registers at the start of the kernel? Nsight Compute	2	1562	May 27, 2024
Inline functions not inlined in CUDA 6.5? CUDA Programming and Performance	7	6041	November 29, 2014
Used Registers vs Live Registers CUDA Programming and Performance	14	3638	June 28, 2020
Problem with reducing registers CUDA Programming and Performance	6	689	June 22, 2011

CALL.ABS.NOINC instruction in SASS

Related topics