But I don’t understand why I need to cast output to uint4*. Because when I define float4 myself, I don’t need to cast output to uint4* and it works fine.
BTW, the type of output is half8* and type of intput is half2, e.g.,
uint4 and float4 have native support for 128 bit loads/stores.
But four consecutive half2 members in a struct don’t automatically use 128 bit vector loads - the compiler instead chooses to load struct members individually even though the size of that struct and its alignment requirements are identical to a uint4/float4 type.
Other than directly after a vectored load, or directly before a vectored store: no. If you care about performance, you wouldn’t want to. It would add an onerous constraint to the compiler’s register allocation algorithm, with negative impact on register pressure and instruction scheduling.
I’m not sure its a win. The efficient way to deal with half types is via the half2 type since it occupies a 32-bit register. After loading a struct of 8 half types, you’d have to be careful how you handled the struct components thereafter.