NVVM compiler and vector types

Hello all,

My kernel code looks like that:

__kernel void showcase(const float4 some_const, global float4* some_output)
{
     float4 b = some_const;
     if(b.y < 0.f)
         b.z = -b.z;
     some_output[0] = b;
}

and the corresponding PTX output looks like

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Thu Sep 12 07:12:40 2013 (1378962760)
// Driver 
//

.version 3.0
.target sm_20, texmode_independent
.address_size 32


.entry showcase(
	.param .align 16 .b8 showcase_param_0[16],
	.param .u32 .ptr .global .align 16 showcase_param_1
)
{
	.reg .f32 	%f<20>;
	.reg .pred 	%p<3>;
	.reg .s32 	%r<3>;


	mov.b32	%r1, showcase_param_0;
	ld.param.u32 	%r2, [showcase_param_1];
	ld.param.v4.f32 	{%f16, %f17, %f18, %f19}, [showcase_param_0];
	setp.lt.f32 	%p1, %f17, 0f00000000;
	not.pred 	%p2, %p1;
	@%p2 bra 	BB0_2;

	neg.f32 	%f3, %f18;
	mov.f32 	%f16, %f16;
	mov.f32 	%f17, %f17;
	mov.f32 	%f18, %f3;
	mov.f32 	%f19, %f19;

BB0_2:
	st.global.v4.f32 	[%r2], {%f16, %f17, %f18, %f19};
	ret;
}

My question is why does it rewrite all four components of the float4 variable? When I outcomment the redundant mov instructions after the branch, the modified listing works well, so they seem to be definitely redundant to me.

Specs: GTX 470 + Driver 327.23, Windows 8.1

The PTX code is only intermediate assembly code. It is later compiled to final assembly for the exact GPU you are using, either at compile time, or just in time, just before the kernel is executed. You can look at the final assembly using the “cuobjdump -sass” command. It should not contain the redundant moves. More info: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/.

Perhaps (I’m not using NVVM) this post http://stackoverflow.com/questions/19125998/apparently-redundant-operations-in-disassembled-cuda-microcode provides useful information to you.

Thanks for the answers. My first guess was that PTX probably issues these instructions to be backwards-compatible with very old non-scalar archs, but the earliest arch is G80, which is already scalar.
Perhaps it is because of auto-vectorization implementation of LVVM.

The PTX “.version 3.0” instruction set in your listing indicates that you’re using the CUDA Toolkit 4.2.
You might want to try that with CUDA 5.0 or 5.5 as well. I’ve seen cases where CUDA 5.0 generated better code than 4.2.