SIMD intrinsics with NVRTC

It seems that when using NVRTX to compile CUDA source code to PTX at runtime (instead of using the offline NVCC compiler), the SIMD video instructions like __vadd2, __vavgs2etc don’t get compiled to the corresponding PTX instruction.

For example this program:

__global__ void kernel(unsigned int* a, unsigned int* b, unsigned int* c, int n) {
	int idx = threadIdx.x;
	a[idx] = __vavgs2(b[idx], c[idx]);
}

With NVCC, it compiles to:

.visible .entry _Z6kernelPjS_S_i(
        .param .u64 _Z6kernelPjS_S_i_param_0,
        .param .u64 _Z6kernelPjS_S_i_param_1,
        .param .u64 _Z6kernelPjS_S_i_param_2,
        .param .u32 _Z6kernelPjS_S_i_param_3
)
{

        ld.param.u64    %rd1, [_Z6kernelPjS_S_i_param_0];
        ld.param.u64    %rd2, [_Z6kernelPjS_S_i_param_1];
        ld.param.u64    %rd3, [_Z6kernelPjS_S_i_param_2];
        cvta.to.global.u64      %rd4, %rd1;
        cvta.to.global.u64      %rd5, %rd3;
        cvta.to.global.u64      %rd6, %rd2;
        mov.u32         %r5, %tid.x;
        mul.wide.s32    %rd7, %r5, 4;
        add.s64         %rd8, %rd6, %rd7;
        ld.global.u32   %r2, [%rd8];
        add.s64         %rd9, %rd5, %rd7;
        ld.global.u32   %r3, [%rd9];
        mov.u32         %r4, 0;
        vavrg2.s32.s32.s32 %r1,%r2,%r3,%r4;
        add.s64         %rd10, %rd4, %rd7;
        st.global.u32   [%rd10], %r1;
        ret;
}

But with NVRTX (nvrtcCompileProgram), it produces code that emulates the intrinsic:

.visible .entry _Z6kernelPjS_S_i(
	.param .u64 _Z6kernelPjS_S_i_param_0,
	.param .u64 _Z6kernelPjS_S_i_param_1,
	.param .u64 _Z6kernelPjS_S_i_param_2,
	.param .u32 _Z6kernelPjS_S_i_param_3
)
{
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<11>;


	ld.param.u64 	%rd1, [_Z6kernelPjS_S_i_param_0];
	ld.param.u64 	%rd2, [_Z6kernelPjS_S_i_param_1];
	ld.param.u64 	%rd3, [_Z6kernelPjS_S_i_param_2];
	cvta.to.global.u64 	%rd4, %rd1;
	cvta.to.global.u64 	%rd5, %rd3;
	cvta.to.global.u64 	%rd6, %rd2;
	mov.u32 	%r4, %tid.x;
	mul.wide.s32 	%rd7, %r4, 4;
	add.s64 	%rd8, %rd6, %rd7;
	ld.global.nc.u32 	%r2, [%rd8];
	add.s64 	%rd9, %rd5, %rd7;
	ld.global.nc.u32 	%r3, [%rd9];
	// inline asm
	{                      
	.reg .u32 a,b,c,r,s,t,u,v;
	mov.b32 a,%r2;          
	mov.b32 b,%r3;          
	and.b32 u,a,0xfffefffe;
	and.b32 v,b,0xfffefffe;
	xor.b32 s,a,b;         
	and.b32 t,a,b;         
	shr.u32 u,u,1;         
	shr.u32 v,v,1;         
	and.b32 c,s,0x00010001;
	and.b32 s,s,0x80008000;
	and.b32 t,t,0x00010001;
	add.u32 r,u,v;         
	add.u32 r,r,t;         
	xor.b32 r,r,s;         
	shr.u32 t,r,15;        
	not.b32 t,t;           
	and.b32 t,t,c;         
	add.u32 r,r,t;         
	mov.b32 %r1,r;          
	}
	// inline asm
	add.s64 	%rd10, %rd4, %rd7;
	st.global.u32 	[%rd10], %r1;
	ret;
}

Is this intended behavior, or is there some option to make it use the vavrg2.s32.s32.s32 instruction.

In addition, after trying the use inline-asm to make it use the SIMD instructions (on a more complex program), the program seemed to become slower than with the emulation. That program was compiled using NVRTX, and then the PTX compiled to CUBIN using cuLinkAddData. Is this expected, or does cuLinkAddData possibly not handle the SIMD instructions as well?

Which SIMD intrinsics are actually backed by hardware instructions depends on the GPU architecture. If memory serves, only GPUs with compute capability 3.x offer the complete set of hardware instructions, while later GPU architectures only support a drastically reduced subset and therefore require emulation for many intrinsics.

So the first thing you would want check whether in your comparison online and offline compilation are configured for the exact same architecture target. My guess is you are not specifying an architecture target for nvcc, which therefore defaults to sm_30, while NVRTX presumably (my guess, I don’t know off the top of my head) defaults to the architecture of the GPU currently in the system (likely not an sm_3x device).

after trying the use inline-asm to make it use the SIMD instructions the program seemed to become slower than with the emulation

This is further indication that your GPU does not have hardware backing for the particular intrinsic you are using. There are two ways emulation can happen: (1) emulation code for the SIMD intrinsic is emitted at PTX level (2) emulation code for the SIMD instruction is emitted at SASS level. Because functionally, the instruction is more general than the intrinsic, its emulation is more involved and therefore slower. By forcing the use of the SIMD instruction at the PTX level, you are encountering case (2).

In my experience with the byte-wise SIMD intrinsics, even on architectures where SIMD intrinsics have to be emulated, they frequently still provide a performance advantage over discrete equivalents. Whether this extends to halfword-wise intrinsics, I do not know, you would have to experiment.

Looks like vabsdiff4 is the only surviving hardware SIMD instruction in the current (7.x, 8.0) cards.

Multiple instruct. = emulation:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions