How may I output source code information in the assembly output

asterix_obelix · March 6, 2024, 7:16am

Hi, I have a an application that I am trying to profile. Upon some profiling, I recognized register pressure as the issue, but wanted to look at the assembly code for a better understanding of what were the problematic lines causing register spillage. I compiled this on A100 GPUs, with CUDA 12.0. I add the following to my CMAKE_CUDA_FLAGS - -g --save-temps -lineinfo. There is however no information output about the source code lines in the ptx files. Instead there is a bunch of $L__info_string*. Below is a snippet of what I am getting. What could I be doing wrong?

$L__BB1_5:
.loc 4 336 9, function_name $L__info_string5, inlined_at 4 353 25
cvt.u32.u64 %r50, %rd9;
shr.u64 %rd26, %rd143, %r50;
.loc 4 354 21, function_name $L__info_string4, inlined_at 4 363 20
.loc 4 346 9, function_name $L__info_string8, inlined_at 4 354 21
mul.lo.s64 %rd59, %rd26, %rd10;
sub.s64 %rd27, %rd23, %rd59;
.loc 3 1865 9, function_name $L__info_string2, inlined_at 2 759 203
cvt.u32.u64 %r51, %rd22;
.loc 3 1865 52, function_name $L__info_string2, inlined_at 2 759 203
add.s32 %r17, %r5, %r51;
setp.ge.s32 %p4, %r42, %r43;
.loc 2 759 219, function_name $L__info_string11, inlined_at 1 16 49
.loc 2 708 9, function_name $L__info_string12, inlined_at 2 759 219
.loc 5 304 5, function_name $L__info_string13, inlined_at 2 708 9
@%p4 bra $L__BB1_13;

striker159 · March 6, 2024, 8:09am

You are doing nothing wrong. What you show is the line information.

Typically you would use a tool like cuobjdump or nvdisasm (CUDA Binary Utilities ) to get the annotated assembly from the compiled program or object file.

njuffa · March 6, 2024, 8:10am

The code shown above is PTX. This is a compiler intermediate format and virtual ISA. PTX code uses virtual registers which are created in a SSA (single static assignment) fashion, that is, a new register is used for each new instruction output created.

Allocation of physical registers occurs as part of the PTX to SASS (machine code) translation as instruction selection and register allocation are GPU architecture dependent. This work is done by ptxas, which is an optimizing compiler. When looking at register pressure, what is relevant is therefore SASS (e.g. from cuobjdump --dump-sass). For examining the “fat parts” of SASS in terms of register usage, you may want to take a look at using nvdisasm --print-life-ranges.

High register pressure != register spillage. High register usage may lead to register spillage, but often it does not. When you add -Xptxas -v to the nvcc command line, what do the resulting basic usage statistics look like?

asterix_obelix · March 6, 2024, 8:24am

Thanks, am I misremembering this, or was annotating assembly code always an extra step with cuda? I was assuming I could get an annotated version of the ptx code the same way I get for x86 upon compilation, e.g.

.Ltmp13:
%bb.10:
#DEBUG_VALUE: init:this ← $rbx
#DEBUG_VALUE: init:reactor_type ← [DW_OP_LLVM_entry_value 1] $esi
.loc 15 23 29 is_stmt 1 Submodules/PelePhysics/Source/Reactions/ReactorCvode.cpp:23:29
leaq 476(%rbx), %rdx
.Ltmp14:
leaq 64(%rsp), %rdi
.loc 15 23 6 is_stmt 0 Submodules/PelePhysics/Source/Reactions/ReactorCvode.cpp:23:6
movl $.L.str.7, %esi
xorl %ecx, %ecx
callq _ZNK5amrex9ParmParse5queryEPKcRii

rs277 · March 6, 2024, 8:26am

In addition to the above advice, if you’re using Nsight Compute, you are able to relate the source with either PTX or SASS, example here.

asterix_obelix · March 6, 2024, 8:31am

Thanks, I will try using both. I know there is register spillage because I already looked at the code a while back, but back then I was working on a different hardware with a different set of profiling tools. But you are right, the two shouldn’t be used interchangeably.

njuffa · March 6, 2024, 8:40am

It is not clear from the information provided how it was determined that register spilling occurs. I’ll note that the use of local memory in SASS by itself is not a reliable indication of register spilling.

Robert_Crovella · March 6, 2024, 2:25pm

nvcc has a -src-in-ptx switch. To get the desired output you must also use either -G or -lineinfo on the compilation command line, along with -src-in-ptx.

system · March 20, 2024, 2:25pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why does cudacc tends to use different registers for input and output for a += b * c? CUDA Programming and Performance	14	1679	April 26, 2017
Strange PTX Output CUDA Programming and Performance	9	3305	December 19, 2014
reducing unnecessary register spilling CUDA Programming and Performance	7	5344	November 9, 2010
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2287	November 14, 2017
PTX info, different "sm_xx" version CUDA Programming and Performance	5	993	October 12, 2021
Is it possible to stop PTXAS from spilling registers? CUDA Programming and Performance	10	3002	March 8, 2018
ptxas register use CUDA Programming and Performance	5	1772	March 4, 2014
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3181	June 3, 2015
PTX instructions are reordered CUDA Programming and Performance	12	1507	May 13, 2024
The optimization options in nvcc have resulted in increased register pressure CUDA Programming and Performance cuda	8	72	December 13, 2024

How may I output source code information in the assembly output

Related topics