Hello,
I’m trying to create an uint8 aggregation typedef:
[codebox]typedef union {
struct __builtin_align__(16) {
int4 a;
int4 b;
};
struct __builtin_align__(16)
{
int x0, x1, x2, x3, x4, x5, x6, x7;
};
} my_uint8;[/codebox]
But its performance is too low. Using myuint8.a.y is lot better than myuint8.x1, for example.
Is there any way of doing a good uint8 in CUDA without lose performance?
I found this on Cuda RELEASE NOTES
“For maximum performance when using multiple byte sizes to access the
same data, coalesce adjacent loads and stores when possible rather
than using a union or individual byte accesses. Accessing the data via
a union may result in the compiler reserving extra memory for the object,
and accessing the data as individual bytes may result in non-coalesced
accesses. This will be improved in a future compiler release.”
actually, the code presented reserves memory on local area, that is much slower than registers.