I’m processing a stream of characters, so as suggested in section 5.1.2.1 in the CUDA Programming Guide, I am reading the character stream in aligned packets instead of one by one. Everything works fine when I do the same thing as the alignedTypes SDK demo:
typedef struct __align___(4) {
unsigned char a,b,c,d;
} alignedChars;
I do a single load from global memory inside my kernel, and when I look at the generated PTX, it does in fact generate the vector load instruction:
ld.global.v4.u8
However, if I try to make the character packet an array:
typedef struct __align___(4) {
unsigned char a[4];
} alignedChars;
The PTX serializes into four 8 bit loads using the instruction:
ld.global.u8
The overall performance degrades accordingly as well.
What gives? The array version seems like a stronger semantic hint that I want the 4 chars contiguous and aligned, and yet it isn’t generating the expected PTX. Are my expectations off here?