How to define a vector type (half8)?

Hi all,

I try to define half8 type to load 8 halves with single instruction (LDG.128).

I define a half8 like this: (compiled with -arch=sm_75)

#include <cuda_fp16.h>

typedef struct __align__(16) {
   half2 x;
   half2 y;
   half2 z;
   half2 w;
} half8;

However, when I try to use half8 I defined to do vectorized loading,

output[0] = reinterpret_cast<half8*>(input)[0];

the generated code still use 4 LDG.E.SYS rather than one single LDG.E.128.

Any ideas?


Keep the half8 struct that you originally defined, typecast output and input point to uint4 just for doing the copying.

((uint4*)output)[0] = ((uint4*)input)[0];

Thank you cbuchner1!

It works!

But I don’t understand why I need to cast output to uint4*. Because when I define float4 myself, I don’t need to cast output to uint4* and it works fine.

BTW, the type of output is half8* and type of intput is half2, e.g.,

__global__ void foo(half2* input, half8* output)

Thank you again~

uint4 and float4 have native support for 128 bit loads/stores.

But four consecutive half2 members in a struct don’t automatically use 128 bit vector loads - the compiler instead chooses to load struct members individually even though the size of that struct and its alignment requirements are identical to a uint4/float4 type.

This makes sense to me. Thanks!

And sometimes I prefer to have “aligned” registers, for example, I want my half8 to be stored in
R4, R5, R6, R7
rather than
R0, R2, R3, R4

It seems that compiler will do this if it thinks to do 128bit loading is helpful.

Is my understanding correct? Or is there any other ways to guarantee that a 128bit struct is stored in 4 consecutive registers?

Other than directly after a vectored load, or directly before a vectored store: no. If you care about performance, you wouldn’t want to. It would add an onerous constraint to the compiler’s register allocation algorithm, with negative impact on register pressure and instruction scheduling.

If you do a structure of 8 half types (rather than 4 half2 vector types), the compiler can generate a 128 bit load for that struct.

Nice. Robert wins this thread.

I’m not sure its a win. The efficient way to deal with half types is via the half2 type since it occupies a 32-bit register. After loading a struct of 8 half types, you’d have to be careful how you handled the struct components thereafter.

Confirmed. Thanks

But to define half8 as 8 halves makes it hard to leverage half2 instructions like HFMA2/HADD2.