I have a problem when loading uchar4 for example from a char *, something like:
char * buf;
uchar4 val = *(uchar4 *)&buf[threadIdx.x];
It seems that CUDA won’t assume that buf is aligned to 4 bytes and breaks the load down to 4 different loads, each one of one byte. The only way to fix this is to load as an integer, but it doesn’t work with larger sizes (such as int2 for example).
Is there a way explicitly tell the compiler that a give pointer is assured to be aligned to some boundary?
Interestingly it seems that OpenCL does assume that (probably as with OpenCL you have to pass a clmem object rather than a pointer).
Thanks