Inline functions not inlined in CUDA 6.5?

EDIT: What I actually did was a typing error which resulted in optimizing out some of sqrt() calls.

Hello,

I have a piece of code which extensively uses float4 variables to speed up memory transactions. Since CUDA does not support any arithmetic operations on vector data types, I have used Nvidia’s helper_math.h and made a few similar wrappers for __shfl_xor(float4, unsigned) and similarly for sqrt() and operator +=. All wrappers are declared inline host device like the others from Nvidia’s helper_math.h.

The problem is, even with -O3, -Xcompiler -O3 and --restrict nvcc flags the functions seem not to be inlined. I mean, it seems that there are function calls in the PTX to a _Z4sqrt6float4() which corresponds to my sqrt() wrapper.

We’ve been taught that all function calls within a kernel are inlined into one kernel function. But even if there is an inlinement after this PTX code is generated (is there?), the kernel is sped up by 25% if I manually insert the sqrt() wrapper’s body in place where it is called. So this doesn’t even seem to be a compiler’s optimization (btw, all the mentioned compiler flags have absolutely no effect on the execution time).

My example code:

inline host device float4 sqrt(float4 a) {
float4 b;
b.x = sqrt(a.x);
b.y = sqrt(a.y);
b.z = sqrt(a.z);
b.w = sqrt(a.w);
return b;
}

global void kernel(float4* p) {

for (…) {
float4 a = p[blockIdx.x*blockDim.x + threadIdx.x];

float4 b = sqrt(a);
}

}

Compiled using the following command (CUDA 6.5):

nvcc -arch=compute_50 -code=compute_50,sm_50 -prec-sqrt=false -Xcompiler -O3,-Wall --restrict -O3 framework.cu -o framework

Am I doing anything wrong or is it an nvcc bug?
Thank you in advance for any help.

Try using the forceinline qualifier and see if that makes a difference.

No, not at all.

sqrt() is for 64-bit doubles, so you should be using sqrtf() instead for 32-bit floats

That probably is not causing your issue however

What you’re doing wrong is looking at the PTX. Look at the SASS. It’s rarely a good idea to judge code based on the PTX.

Another suggestion I have is to include a simple, complete code, that others can copy, paste and compile, and inspect. If you want help, the smart move is to make it easy for others to help you. This also helps to preclude the possibility of a difference between what others are trying to do (guess at) and what you intend to do.

Here’s a simple, complete code built around the snippets you have shown:

#include <vector_types.h>

inline __host__ __device__ float4 my_sqrt(float4 a) {
  float4 b;
  b.x = sqrt(a.x);
  b.y = sqrt(a.y);
  b.z = sqrt(a.z);
  b.w = sqrt(a.w);
  return b;
}

__global__ void kernel(float4* p) {

  float4 a = p[blockIdx.x*blockDim.x + threadIdx.x];
  float4 b = my_sqrt(a);
  p[blockIdx.x*blockDim.x + threadIdx.x] = b;

}
int main(){

  float4 *d_data = NULL;
  kernel<<<1,1>>>(d_data);
  cudaDeviceSynchronize();
  return 0;
}

note that I have chosen to give my function a name other than sqrt, to ease in disambiguation later.

I chose to compile with:

nvcc -O3 -arch=sm_50 -lineinfo -cubin -o t608.cubin t608.cu

I then disassembled the resultant SASS with:

nvdisasm --print-line-info t608.cubin

and the relevant part of the output (the kernel code) is:

//--------------------- .text._Z6kernelP6float4   --------------------------
        .section        .text._Z6kernelP6float4,"ax",@progbits
        .sectioninfo    @"SHI_REGISTERS=16"
        .align  32
        .global         _Z6kernelP6float4
        .type           _Z6kernelP6float4,@function
        .size           _Z6kernelP6float4,(.L_36 - _Z6kernelP6float4)
        .other          _Z6kernelP6float4,@"STO_CUDA_ENTRY STV_DEFAULT"
_Z6kernelP6float4:
.text._Z6kernelP6float4:
        /*0008*/                   MOV R1, c[0x0][0x20];
        //## File "/home-2/robertc/misc/t608.cu", line 14
        /*0010*/                   S2R R0, SR_CTAID.X;
        /*0018*/                   S2R R2, SR_TID.X;
        /*0028*/                   MOV R3, c[0x0][0x8];
        /*0030*/                   XMAD R2, R3.reuse, R0.reuse, R2;
        /*0038*/                   XMAD.MRG R0, R3, R0.H1, RZ;
        /*0048*/                   XMAD.PSL.CBCC R0, R3.H1, R0.H1, R2;
        /*0050*/                   SHL.W R2, R0.reuse, 0x4;
        /*0058*/                   SHR.U32 R0, R0, 0x1c;
        /*0068*/                   IADD R2.CC, R2, c[0x0][0x140];
        /*0070*/                   IADD.X R3, R0, c[0x0][0x144];
        /*0078*/                   LDG.E.128 R8, [R2];
        //## File "/shared/apps/cuda/CUDA-v6.5.14/include/device_functions.h", line 3964
        /*0088*/                   MOV R4, R11;
        /*0090*/                   CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32);
        /*0098*/                   MOV R7, R4;
        /*00a8*/                   MOV R4, R10;
        /*00b0*/                   CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32);
        /*00b8*/                   MOV R6, R4;
        /*00c8*/                   MOV R4, R9;
        /*00d0*/                   CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32);
        /*00d8*/                   MOV R5, R4;
        /*00e8*/                   MOV R4, R8;
        /*00f0*/                   CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32);
        //## File "/home-2/robertc/misc/t608.cu", line 16
        /*00f8*/                   STG.E.128 [R2], R4;
        /*0108*/                   DEPBAR {0};
        //## File "/home-2/robertc/misc/t608.cu", line 18
        /*0110*/                   EXIT;
        .weak           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32
        .type           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32,@function
        .size           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32,($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath - $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32)
$_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32:
        /*0118*/                   IADD32I R0, R4, -0xd000000;
        /*0128*/                   ISETP.LE.U32.AND P0, PT, R0, c[0x2][0x0], PT;
        /*0130*/               @P0 BRA `(.L_1);
        /*0138*/                   CAL `($_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath);
        /*0148*/                   BRA `(.L_2);
.L_1:
        /*0150*/                   MUFU.RSQ R13, R4;
        /*0158*/                   FMUL.FTZ R14, R13, R4;
        /*0168*/                   F2F.FTZ.F32.F32 R15, -R14;
        /*0170*/                   FMUL.FTZ R13, R13, 0.5;
        /*0178*/                   FFMA R0, R15, R14, R4;
        /*0188*/                   FFMA R4, R0, R13, R14;
.L_2:
        /*0190*/                   RET;
        .weak           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath
        .type           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath,@function
        .size           $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath,(.L_36 - $_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath)
$_Z6kernelP6float4$__cuda_sm20_sqrt_rn_f32_slowpath:
        /*0198*/                   LOP.AND.NZ P0, RZ, R4, c[0x2][0x4];
        /*01a8*/              @!P0 BRA `(.L_3);
        /*01b0*/                   FSETP.LT.FTZ.AND P0, PT, R4, RZ, PT;
        /*01b8*/              @!P0 BRA `(.L_4);
        /*01c8*/                   MOV32I R4, 0x7fffffff;
        /*01d0*/                   BRA `(.L_3);
.L_4:
        /*01d8*/                   FSETP.LE.FTZ.AND P0, PT, |R4|, +INF , PT;
        /*01e8*/               @P0 BRA `(.L_5);
        /*01f0*/                   FADD.FTZ R4, R4, 1;
        /*01f8*/                   BRA `(.L_3);
.L_5:
        /*0208*/                   FSETP.EQ.FTZ.AND P0, PT, |R4|, +INF , PT;
        /*0210*/               @P0 BRA `(.L_3);
        /*0218*/                   FFMA R0, R4, 1.84467440737095516160e+19, RZ;
        /*0228*/                   MUFU.RSQ R11, R0;
        /*0230*/                   FMUL.FTZ R13, R11, R0;
        /*0238*/                   F2F.FTZ.F32.F32 R14, -R13;
        /*0248*/                   FMUL.FTZ R11, R11, 0.5;
        /*0250*/                   FFMA R0, R14, R13, R0;
        /*0258*/                   FFMA R0, R0, R11, R13;
        /*0268*/                   FMUL.FTZ R4, R0, 2.3283064365386962891e-10;
.L_3:
        /*0270*/                   RET;
.L_6:
        /*0278*/                   BRA `(.L_6);
.L_36:

(the line 14 that is referenced on line 12 of the above output refers to the source code, it is the first non-whitespace line in the kernel body code)

So the function is inlined. There is no CAL to my_sqrt or anything like that, and the only CAL instructions in the kernel code pertain to CUDA math library routines to perform the basic sqrt function on POD types.

Regarding your statement about modifying the PTX and seeing a 25% speedup, I have no idea what you did. If you want to post enough information so that someone could reproduce your results, you may get more useful help.

If you want to learn how to use the binary utilities like nvdisasm, start here:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#abstract

note that -prec-sqrt=false will affect the above output, but not in any way that pertains to inlining of my_sqrt

and if you want to see what it looks like when the my_sqrt function is actually called from kernel code, change the inline to noinline in the source code I have posted, and repeat the steps.

I can’t reproduce the reported issue. I compiled Kyselejsyrecek’s code with -arch=sm_50 -prec-sqrt=false and find that the function sqrt(float4) was inlined as desired. See disassembly at the end.

I am not sure where you learned this, but your source was/is mistaken. It was not even completely true pre-ABI, because various code provided by NVIDIA came in the form of called subroutines using compiler internal calling conventions. For example, you will find that if your code contains integer divisions with variable divisor the resulting machine code (SASS) contains a subroutine invoked with the CAL instruction when built for an sm_1x target. You can easily verify this by running the executable produced by nvcc through cuobjdmp --dump-sass.

In general emulated PTX instructions such as integer and floating-point division may use inlined code or called subroutines as the NVIDIA engineers see fit. There are trade-offs between call overhead and code size to be considered when making these decisions.

For user code, inlining is at the discretion of the compiler (which has heuristics controlling this) but programmers can also affect inlining behavior by use of the noinline and forceinline function attributes. As with most C++ compilers regardless of platform, the inline attribute is largely ignored as far as actual inlining goes, since it is a hint.

Some examples (CUDA 6.5, architecture >= sm_20):

  • 32-bit integer division is an inlined sequence of about 15 instructions. 64-bit integer division on the other hand is a called subroutine of about 70 instructions. For sm_50 the instruction count is higher for either case but I don't know what the respective count is off the top of my head
  • The approximate single-precision square root produced when compiling with -prec-sqrt=false is an inlined code sequence, while the IEEE-rounded single-precision square root produced when compiling with -prec-sqrt=true is a called subroutine.
  • While functions from the CUDA math library are generally declared __forceinline__ some of the internal functions use the __noinline__ attribute and do not get inlined, in particular the slow-path code in the argument reduction for the trigonometric functions sin(), cos(), tan(), sincos().
arch = sm_50
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit
identifier = hello.cu

        code for sm_50
                Function : _Z6kernelI6float4EvPT_PKS1_
        .headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                                                  /* 0x003fb400e3a007e6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                                          /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;                                              /* 0xf0c8000002170000 */
        /*0018*/                   ISETP.GT.U32.AND P0, PT, R0, 0x7f, PT;                         /* 0x3668038007f70007 */
                                                                                                  /* 0x001f9040fe4007fd */
        /*0028*/               @P0 EXIT;                                                          /* 0xe30000000000000f */
        /*0030*/                   SHL.W R2, R0.reuse, 0x4;                                       /* 0x3848008000470002 */
        /*0038*/                   SHR R0, R0, 0x1c;                                              /* 0x3829000001c70000 */
                                                                                                  /* 0x001ec400fc4007ed */
        /*0048*/                   IADD R4.CC, R2, c[0x0][0x148];                                 /* 0x4c10800005270204 */
        /*0050*/                   IADD.X R5, R0, c[0x0][0x14c];                                  /* 0x4c10080005370005 */
        /*0058*/                   LDG.E.128 R4, [R4];                                            /* 0xeed6200000070404 */
                                                                                                  /* 0x001fc460fe2007e1 */
        /*0068*/                   IADD R2.CC, R2, c[0x0][0x140];                                 /* 0x4c10800005070202 */
        /*0070*/                   FSETP.LT.AND P0, PT, |R7|.reuse, 1.1754943508222875e-038, PT;  /* 0x36b1038080070787 */
        /*0078*/                   FMUL R8, R7, 16777216;                                         /* 0x3868004b80070708 */
                                                                                                  /* 0x001fc800fe2207f1 */
        /*0088*/                   FSETP.LT.AND P1, PT, |R6|.reuse, 1.1754943508222875e-038, PT;  /* 0x36b103808007068f */
        /*0090*/                   FMUL R9, R6, 16777216;                                         /* 0x3868004b80070609 */
        /*0098*/                   FSETP.LT.AND P2, PT, |R5|, 1.1754943508222875e-038, PT;        /* 0x36b1038080070597 */
                                                                                                  /* 0x001fc400fe2007e7 */
        /*00a8*/                   FSETP.LT.AND P3, PT, |R4|, 1.1754943508222875e-038, PT;        /* 0x36b103808007049f */
        /*00b0*/                   SEL R3, R8, R7, P0;                                            /* 0x5ca0000000770803 */
        /*00b8*/                   FMUL R7, R5, 16777216;                                         /* 0x3868004b80070507 */
                                                                                                  /* 0x001fc000fc8007f0 */
        /*00c8*/                   SEL R6, R9, R6, P1;                                            /* 0x5ca0008000670906 */
        /*00d0*/                   MUFU.RSQ R3, R3;                                               /* 0x5080000000570303 */
        /*00d8*/                   FMUL R9, R4, 16777216;                                         /* 0x3868004b80070409 */
                                                                                                  /* 0x003fd000fe00071d */
        /*00e8*/                   MUFU.RSQ R6, R6;                                               /* 0x5080000000570606 */
        /*00f0*/                   SEL R5, R7, R5, P2;                                            /* 0x5ca0010000570705 */
        /*00f8*/                   MUFU.RCP R6, R6;                                               /* 0x5080000000470606 */
                                                                                                  /* 0x001cd000e08007f0 */
        /*0108*/                   SEL R4, R9, R4, P3;                                            /* 0x5ca0018000470904 */
        /*0110*/                   MUFU.RSQ R5, R5;                                               /* 0x5080000000570505 */
        /*0118*/                   MUFU.RSQ R4, R4;                                               /* 0x5080000000570404 */
                                                                                                  /* 0x005cc401e0800274 */
        /*0128*/                   MUFU.RCP R7, R3;                                               /* 0x5080000000470307 */
        /*0130*/                   MUFU.RCP R5, R5;                                               /* 0x5080000000470505 */
        /*0138*/                   MUFU.RCP R4, R4;                                               /* 0x5080000000470404 */
                                                                                                  /* 0x011fc404fe2007f1 */
        /*0148*/               @P1 FMUL R6, R6, 0.000244140625;                                   /* 0x3868003980010606 */
        /*0150*/                   IADD.X R3, R0, c[0x0][0x144];                                  /* 0x4c10080005170003 */
        /*0158*/               @P0 FMUL R7, R7, 0.000244140625;                                   /* 0x3868003980000707 */
                                                                                                  /* 0x0003c402fc400fe1 */
        /*0168*/               @P2 FMUL R5, R5, 0.000244140625;                                   /* 0x3868003980020505 */
        /*0170*/               @P3 FMUL R4, R4, 0.000244140625;                                   /* 0x3868003980030404 */
        /*0178*/                   STG.E.128 [R2], R4;                                            /* 0xeede200000070204 */
                                                                                                  /* 0x001ffc00ffe007e3 */
        /*0188*/                   DEPBAR {0};                                                    /* 0xf0f0000000070001 */
        /*0190*/                   EXIT;                                                          /* 0xe30000000007000f */
        /*0198*/                   BRA 0x198;                                                     /* 0xe2400fffff87000f */
                                                                                                  /* 0x001f8000fc0007e0 */
        /*01a8*/                   NOP;                                                           /* 0x50b0000000070f00 */
        /*01b0*/                   NOP;                                                           /* 0x50b0000000070f00 */
        /*01b8*/                   NOP;                                                           /* 0x50b0000000070f00 */
                ............................................

I am sorry for claiming nonsenses.
What I probably did was a typo which resulted in optimizing out some of the sqrt() calls.

Thank you anyway for your answers, they were helpful for me.

Kyselejsyrecek, check whether you are building with -G. In order to prepare code for debugging, it looks like the compiler has to disable all inlining (this includes ignoring forceinline). At least this is what some quick experiments of mine seem to show. Presumably this is done so users can set breakpoints on the entry point of functions.