__clz and Endianess for reading unary code

yurijgera · July 8, 2011, 11:06am

I’m reading variable length code (unary code) in CUDA.
Input data is unsigned char * array, it’s big endian, MSB first.

I must read codeword that it maximum 24 bit long, and can have maximum offset of 7 bits from the byte boundary.

On CPU i’m using assembler to read 32bits as integer at byte boundary. Due to x86’s litle-endian coding my int is in wrong order, so I use bswap instruction to convert x86’s little endian byte order to correct big-endian byte order, then shift left by bit offset and then use bsr instruction to count leading zeros and so on…

What about CUDA?
In CUDA counting leading zeros is even simpler using __clz() intrinsic.
However first I must read 4 big-endian bytes from unsigned char * array into 32bit int register.
From what I know CUDA behaves like x86 when I read from a int casted pointer, it assumes little-endian and read bytes in my case in wrong order.

QUESTION:

Is there any way to read 32 bit int into CUDA’s int register from a pointer pointing to big-endian aligned integer.

Assuming A, B, C and D are one byte each and we have char/byte array:
unsigned char * Data = ABCD…

reading
unsigned int Word = (unsigned int *)&Data[0];

will produce unsigned int with wrong byte order:
unsigned int Word = DCBA;

njuffa · July 8, 2011, 11:24am

You can re-use your CPU approach in CUDA by using __byte_perm (word, 0, 0x0123) in place of BSWAP. __byte_perm() has hardware support on sm_2x GPUs, it is emulated for sm_1x platforms. I can’t say how good the emulation performance would be for the BSWAP case on sm_1x. I would say just give it a try. If it is too slow, you could always supply your own BSWAP emulation.

One difference between x86 CPUs and NVIDIA GPUs is that on the GPU all data must be naturally aligned, otherwise the behavior of a memory access is undefined. So when you cast the char* to an int*, make sure the pointer is 4-byte aligned.

yurijgera · July 8, 2011, 11:56am

My int pointers are not 4-byte aligned, as the codeword length varies from 1 to 24 bits and int pointer can start at any byte offset.
To maintain 4-byte alignment I will need to use 64bit integers for reading.

However thank you for that hint,
I will try __byte_perm and compare it against 4 separate char reads:

unsigned int Word = ((Data[ByteOffset] << 24) & (Data[ByteOffset + 1] << 16) & (Data[ByteOffset + 2] << 8) & Data[ByteOffset + 3]) << BitOffset;

njuffa · July 8, 2011, 12:10pm

You might also want to look into reading the entire stream as pairs of aligned words, while extracting the desired 1 to 3 code bytes via __byteperm() on the fly. __byte_perm() can return four arbitrary bytes from a total of eight input bytes passed in via two 4-byte words.

[Later:]

Here is some bitfield extraction code I dug up that you may find useful:

// extracts a bit field up to 25 bits in length from a 64-bit integer

// startbit must be in 0...63, startbit + length must be in 0...64, 

// length must be in 0..25

__device__ __forceinline__ unsigned int bfe_25_64 (unsigned long long a,

                                                   unsigned int startbit,

                                                   unsigned int length)

{

    unsigned int result;

    asm ("{\n\t"

         ".reg .u32 byte_shift, new_start, select, bytes, alo, ahi;\n\t"

         "shr.u32         byte_shift, %1, 3;\n\t"

         "and.b32         new_start, %1, 7;\n\t"

         "mad.lo.u32      select, byte_shift, 0x1111, 0x3210;\n\t"

         "mov.b64         {alo, ahi}, %2;\n\t"

         "prmt.b32        bytes, alo, ahi, select;\n\t"

         "bfe.u32         %0, bytes, new_start, %3;\n\t"

         "}"

         : "=r"(result)

         : "r"(startbit), "l"(a), "r"(length));

    return result;

}

yurijgera · July 9, 2011, 6:24pm

Thanks again for new advise, I didn’t knew CUDA can be programmed directly using assembler. Is there some reference manual for this?

However since the data is read only, I switched to using texture memory to utilize on-chip cache.

As texture cannot be fetched on long long int’s, I had to use int and combine results.

The simnple approach of reading 4 separate char’s turned out to be significantly faster.

texture<unsigned char, cudaTextureType1D, cudaReadModeElementType> DataTexture;

...

	unsigned int BO = BitOffset >> 3;

	unsigned int Word = (tex1Dfetch(DataTexture, BO + 0) << 24) |

						(tex1Dfetch(DataTexture, BO + 1) << 16) |

						(tex1Dfetch(DataTexture, BO + 2) << 8) |

						(tex1Dfetch(DataTexture, BO + 3));

	Word = Word << (BitOffset & 7);

	unsigned int Zeros = __clz(Word);

texture<unsigned int, cudaTextureType1D, cudaReadModeElementType> DataTexture;

...

	unsigned int BO = BitOffset >> 3; // byte offset

	unsigned int BA = (BitOffset >> 5); // unsigned int texture index (aligned)

	// fetch two ints from texture.

	unsigned int W1 = tex1Dfetch(DataTexture, BA);

	unsigned int W2 = tex1Dfetch(DataTexture, BA + 1);

 	// reoder

	W1 = __byte_perm(W1, 0, 0x0123);

	W2 = __byte_perm(W2, 0, 0x0123);

	// combine

	unsigned long long W = ((unsigned long long)W1 << 32) | W2;

	// shift in the required unsigned int

	W >>= 8 * (4 - (BO - (BA << 2)));

        // get the required word out of it

	unsigned int Word = (unsigned int )W;

	Word = Word << (BitOffset & 7);

	unsigned int Zeros = __clz(Word);

njuffa · July 9, 2011, 7:18pm

Support for inline PTX assembly is a new feature in CUDA 4.0. There should be a PDF “Using Inline PTX Assembly in CUDA” installed as part of the toolkit that describes it.

64-bit data types can be read through the texture path by using an “int2” texture. This is commonly used for reading double-precision elements via the texture path, by re-interpreting the pair of integers returned by the texture fetch using __hiloint2double().

The expensive part of your second code snippet is the 64-bit right shift of “W”. There is no hardware support for this operation, so it must be emulated using 32-bit instructions. If I remember correctly, this takes eight or nine instructions on Fermi, meaning this shift alone is approximately as expensive as calling the bfe_25_64() function I showed.