aligned arrays how to do aligned loads into arrays?

I’m processing a stream of characters, so as suggested in section 5.1.2.1 in the CUDA Programming Guide, I am reading the character stream in aligned packets instead of one by one. Everything works fine when I do the same thing as the alignedTypes SDK demo:

typedef struct __align___(4) {
   unsigned char a,b,c,d;
} alignedChars;

I do a single load from global memory inside my kernel, and when I look at the generated PTX, it does in fact generate the vector load instruction:

ld.global.v4.u8 

However, if I try to make the character packet an array:

typedef struct __align___(4) {
   unsigned char a[4];
} alignedChars;

The PTX serializes into four 8 bit loads using the instruction:

ld.global.u8

The overall performance degrades accordingly as well.

What gives? The array version seems like a stronger semantic hint that I want the 4 chars contiguous and aligned, and yet it isn’t generating the expected PTX. Are my expectations off here?

Looks like a compiler bug to me. I would have expected the same as you.