I’m reading variable length code (unary code) in CUDA.
Input data is unsigned char * array, it’s big endian, MSB first.
I must read codeword that it maximum 24 bit long, and can have maximum offset of 7 bits from the byte boundary.
On CPU i’m using assembler to read 32bits as integer at byte boundary. Due to x86’s litle-endian coding my int is in wrong order, so I use bswap instruction to convert x86’s little endian byte order to correct big-endian byte order, then shift left by bit offset and then use bsr instruction to count leading zeros and so on…
What about CUDA?
In CUDA counting leading zeros is even simpler using __clz() intrinsic.
However first I must read 4 big-endian bytes from unsigned char * array into 32bit int register.
From what I know CUDA behaves like x86 when I read from a int casted pointer, it assumes little-endian and read bytes in my case in wrong order.
Is there any way to read 32 bit int into CUDA’s int register from a pointer pointing to big-endian aligned integer.
Assuming A, B, C and D are one byte each and we have char/byte array:
unsigned char * Data = ABCD…
unsigned int Word = (unsigned int *)&Data;
will produce unsigned int with wrong byte order:
unsigned int Word = DCBA;