uint8 on Cuda

Hello,

I’m trying to create an uint8 aggregation typedef:

[codebox]typedef union {

struct __builtin_align__(16) {

    int4 a;

    int4 b;

};

struct __builtin_align__(16)

{

  int x0, x1, x2, x3, x4, x5, x6, x7;

};

} my_uint8;[/codebox]

But its performance is too low. Using myuint8.a.y is lot better than myuint8.x1, for example.

Is there any way of doing a good uint8 in CUDA without lose performance?

I found this on Cuda RELEASE NOTES

“For maximum performance when using multiple byte sizes to access the
same data, coalesce adjacent loads and stores when possible rather
than using a union or individual byte accesses. Accessing the data via
a union may result in the compiler reserving extra memory for the object,
and accessing the data as individual bytes may result in non-coalesced
accesses. This will be improved in a future compiler release.”

actually, the code presented reserves memory on local area, that is much slower than registers.