Char to uint32_t Pointer recasting Oddity Single thread exhibits different behavior then others

I’m currently working on a problem that allocates a (char *) buffer on the device. Each thread then reads a “chunk” of this buffer based on its thread index (chunkNum = buffer + index * chunkSize).

The offending line is the following:
uint32_t *word = (uint32_t *)(input + maxW * 4); //maxW is 0 in all cases, this is confirmed

The kernel receives an initial buffer = ‘abcabcabc’.

Thread one receives input ‘abc’ (&buffer[0]), two receives input ‘abc’ (&buffer[3]), and three receives input ‘abc’ (&buffer[6]).

When running the offending line I receive the following output for each thread:
thread one: ‘abca’ (<- *word)
thread two: ‘abca’ (<- *word)
thread three: ‘bcab’ (<- *word)

Why is this? Why am I not receiving ‘abc{garbage value}’? Even if I allocate an extra two elements in the buffer and set the buffer to ‘abcabcabc’ || 0x99 || 0x99 , my third thread is still giving me ‘bcab’. Thanks for any help, I have found this horribly puzzling.

Similar to many RISC CPUs, the GPU requires all accesses to be naturally aligned. This means that a 4-byte int object has to be 4-byte aligned, an 8-byte double object has to be 8-byte aligned, and so on. Misaligned accesses lead to undefined behavior. Therefore, care must be taken when converting a pointer to data with a lesser alignment requirement (e.g. char) to a pointer to data of a stricter alignment requirement,(e.g uint32_t), and then de-referencing the second pointer.

You could either ensure that, by design, any char* passed to your function is 4-byte aligned (that is typically not possible for functions that are part of a library called by many different codes), or the function needs to examine the alignment, then work byte-wise to the next alignment boundary from which point on it can work word-wise. To extract words at misaligned offsets, you can read the data in smaller aligned chunks, then combine it using shift plus add (or shift plus OR).

Ahh. Shoot! I completely didn’t think of that.

Regarding the eventual performance of the code what would be the optimal way of assuring alignment?

  1. Prior to passing the CUDA kernel my input, make sure the host CPU 4-byte aligns every incoming message.

  2. In the CUDA kernel itself, use the shift + OR/ADD to read in smaller chunks? I assume these operations are considerably speedy in the kernel. The device itself is of compute capability 1.1 so I’m aware 32-bit multiply’s are slow, but I assume adds, shifts, etc on unsigned integers are still extremely fast?

Thanks!

Obviously any approach that avoids doing work at runtime will be superior. So if you can copy the data from the host to a suitably aligned buffer in the device, and then use only aligned accesses, that would be best.

However, your initial description suggests that even in a situation where the first element (i.e. the start of the buffer) is suitably aligned, the processing proceeds at non-aligned offset, which then requires the use of the gather technique. As a micro-optimization on Fermi-class machines, I would suggest combining the chunks via left-shift (with constant shift factor) plus add, as this nicely maps to the ISCADD instruction in the hardware.

The overhead for collecting/splitting to ensure wide aligned accesses will typically still be lower than processing the data a byte at a time. However, the code might become a bit involved when potentially mis-aligned accesses to multiple buffers are involved. If you enjoy reading assembly language, here is a worked example in the form of an optimized strcpy() for SPARC which I wrote many years ago:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/sparcv9/gen/strcpy.s