questions about PTX file

i’m a little confused with the ptx file, and get 3 questions about that.

(1)
I think that we can load 16Byte from local memory with only one instruction, so I pack 16 bytes data(8 short int) and load/store them together:

value[i+j] = v_out;

value is an array in local memory, and v_out is a structure like this:
struct align(16) value_group {
short int h0;
short int h1;
short int h2;
short int h3;
short int e0;
short int e1;
short int e2;
short int e3;
};

but in the ptx file, there are 8 instructions instead of only one.

st.local.s16    [%r43+0], %rh22;    // id:418 __cuda___cuda_value1648+0x0
mov.s16     %rh23, %rh16;           //
st.local.s16    [%r43+2], %rh23;    // id:419 __cuda___cuda_value1648+0x0
mov.s16     %rh24, %rh19;           //
st.local.s16    [%r43+4], %rh24;    // id:420 __cuda___cuda_value1648+0x0
st.local.s16    [%r43+6], %r164;    // id:421 __cuda___cuda_value1648+0x0
mov.s16     %rh25, %rh15;           //
st.local.s16    [%r43+8], %rh25;    // id:422 __cuda___cuda_value1648+0x0
mov.s16     %rh26, %rh18;           //
st.local.s16    [%r43+10], %rh26;   // id:423 __cuda___cuda_value1648+0x0
mov.s16     %rh27, %rh21;           //
st.local.s16    [%r43+12], %rh27;   // id:424 __cuda___cuda_value1648+0x0
mov.s16     %rh28, %rh12;           //
st.local.s16    [%r43+14], %rh28;   // id:425 __cuda___cuda_value1648+0x0
add.u32     %r43, %r43, 16;         //

could someone explain this is why?

(2)
when i add an int variable and a short variable(I have lots of these operations in my code), there are a lot of CVT instructions within ptx file, like this:
cvt.s32.s16 %r160, %rh12; //
does these instructions really be executed in hardware. because after I change the short type to int, there is no CVT instruction in PTX file, however, the execution time is almost the same as before.

(3)
can someone tell me the difference of these two kinds of register: %rh? and %r?

I really appreciate any help.

AFAIK, CVT stands for “Convert” - This might be executed in hardware where the hardware just truncates and zero-extends a 32-bit number to form a 16-bit… Just my own guess…