It seems that when using NVRTX to compile CUDA source code to PTX at runtime (instead of using the offline NVCC compiler), the SIMD video instructions like `__vadd2`

, `__vavgs2`

etc don’t get compiled to the corresponding PTX instruction.

For example this program:

```
__global__ void kernel(unsigned int* a, unsigned int* b, unsigned int* c, int n) {
int idx = threadIdx.x;
a[idx] = __vavgs2(b[idx], c[idx]);
}
```

With NVCC, it compiles to:

```
.visible .entry _Z6kernelPjS_S_i(
.param .u64 _Z6kernelPjS_S_i_param_0,
.param .u64 _Z6kernelPjS_S_i_param_1,
.param .u64 _Z6kernelPjS_S_i_param_2,
.param .u32 _Z6kernelPjS_S_i_param_3
)
{
ld.param.u64 %rd1, [_Z6kernelPjS_S_i_param_0];
ld.param.u64 %rd2, [_Z6kernelPjS_S_i_param_1];
ld.param.u64 %rd3, [_Z6kernelPjS_S_i_param_2];
cvta.to.global.u64 %rd4, %rd1;
cvta.to.global.u64 %rd5, %rd3;
cvta.to.global.u64 %rd6, %rd2;
mov.u32 %r5, %tid.x;
mul.wide.s32 %rd7, %r5, 4;
add.s64 %rd8, %rd6, %rd7;
ld.global.u32 %r2, [%rd8];
add.s64 %rd9, %rd5, %rd7;
ld.global.u32 %r3, [%rd9];
mov.u32 %r4, 0;
vavrg2.s32.s32.s32 %r1,%r2,%r3,%r4;
add.s64 %rd10, %rd4, %rd7;
st.global.u32 [%rd10], %r1;
ret;
}
```

But with NVRTX (`nvrtcCompileProgram`

), it produces code that emulates the intrinsic:

```
.visible .entry _Z6kernelPjS_S_i(
.param .u64 _Z6kernelPjS_S_i_param_0,
.param .u64 _Z6kernelPjS_S_i_param_1,
.param .u64 _Z6kernelPjS_S_i_param_2,
.param .u32 _Z6kernelPjS_S_i_param_3
)
{
.reg .b32 %r<5>;
.reg .b64 %rd<11>;
ld.param.u64 %rd1, [_Z6kernelPjS_S_i_param_0];
ld.param.u64 %rd2, [_Z6kernelPjS_S_i_param_1];
ld.param.u64 %rd3, [_Z6kernelPjS_S_i_param_2];
cvta.to.global.u64 %rd4, %rd1;
cvta.to.global.u64 %rd5, %rd3;
cvta.to.global.u64 %rd6, %rd2;
mov.u32 %r4, %tid.x;
mul.wide.s32 %rd7, %r4, 4;
add.s64 %rd8, %rd6, %rd7;
ld.global.nc.u32 %r2, [%rd8];
add.s64 %rd9, %rd5, %rd7;
ld.global.nc.u32 %r3, [%rd9];
// inline asm
{
.reg .u32 a,b,c,r,s,t,u,v;
mov.b32 a,%r2;
mov.b32 b,%r3;
and.b32 u,a,0xfffefffe;
and.b32 v,b,0xfffefffe;
xor.b32 s,a,b;
and.b32 t,a,b;
shr.u32 u,u,1;
shr.u32 v,v,1;
and.b32 c,s,0x00010001;
and.b32 s,s,0x80008000;
and.b32 t,t,0x00010001;
add.u32 r,u,v;
add.u32 r,r,t;
xor.b32 r,r,s;
shr.u32 t,r,15;
not.b32 t,t;
and.b32 t,t,c;
add.u32 r,r,t;
mov.b32 %r1,r;
}
// inline asm
add.s64 %rd10, %rd4, %rd7;
st.global.u32 [%rd10], %r1;
ret;
}
```

Is this intended behavior, or is there some option to make it use the `vavrg2.s32.s32.s32`

instruction.

In addition, after trying the use inline-asm to make it use the SIMD instructions (on a more complex program), the program seemed to become slower than with the emulation. That program was compiled using NVRTX, and then the PTX compiled to CUBIN using `cuLinkAddData`

. Is this expected, or does `cuLinkAddData`

possibly not handle the SIMD instructions as well?