aligned and misaligned structures

In CUDA SDK:…ml#alignedTypes

I got the following output:

Testing misaligned types…

Avg. time: 196.819717 ms / Copy throughput: 0.236593 GB/s.

Testing aligned types…

Avg. time: 5.652469 ms / Copy throughput: 8.238193 GB/s.

The results shows about 35 times differences between RGBA8_misaligned and RGBA8, however, I don’t understand the real differences between them, they are defined as:

typedef struct{
unsigned char r, g, b, a;
} RGBA8_misaligned;

typedef struct align(4){
unsigned char r, g, b, a;
} RGBA8;

Both struct have size of 4 bytes.

I understand if you define a struct with size 3 bytes, you may need to adjust to 4 bytes, or in the physical memory, the starting memory address affects the speed too. but, it doesn’t seem to be a reason here.

So what make the differences?

It turns out that adding align keyword lead to optimized ptx codes

// with align keyword         {%r7,%r8,%r9,%r10}, [%rd5+0];         [%rd8+0], {%r7,%r8,%r9,%r10};

// without align keyword %rh4, [%rd5+0]; < [%rd8+0], %rh4; < %rh5, [%rd5+1]; < [%rd8+1], %rh5; < %rh6, [%rd5+2]; < [%rd8+2], %rh6; < %rh7, [%rd5+3]; < [%rd8+3], %rh7; <

wow! I didn’t know this! Thanks for pointing this out :)