aligned and misaligned structures

In CUDA SDK: http://www.nvidia.com/content/cudazone/cud…ml#alignedTypes

I got the following output:

Testing misaligned types…

RGBA8_misaligned…
Avg. time: 196.819717 ms / Copy throughput: 0.236593 GB/s.
TEST PASSED

Testing aligned types…

RGBA8…
Avg. time: 5.652469 ms / Copy throughput: 8.238193 GB/s.
TEST PASSED

The results shows about 35 times differences between RGBA8_misaligned and RGBA8, however, I don’t understand the real differences between them, they are defined as:

typedef struct{
unsigned char r, g, b, a;
} RGBA8_misaligned;

typedef struct align(4){
unsigned char r, g, b, a;
} RGBA8;

Both struct have size of 4 bytes.

I understand if you define a struct with size 3 bytes, you may need to adjust to 4 bytes, or in the physical memory, the starting memory address affects the speed too. but, it doesn’t seem to be a reason here.

So what make the differences?

It turns out that adding align keyword lead to optimized ptx codes

// with align keyword

    ld.global.v4.u8         {%r7,%r8,%r9,%r10}, [%rd5+0];
    st.global.v4.u8         [%rd8+0], {%r7,%r8,%r9,%r10};

// without align keyword
ld.global.u8 %rh4, [%rd5+0]; <
st.global.u8 [%rd8+0], %rh4; <
ld.global.u8 %rh5, [%rd5+1]; <
st.global.u8 [%rd8+1], %rh5; <
ld.global.u8 %rh6, [%rd5+2]; <
st.global.u8 [%rd8+2], %rh6; <
ld.global.u8 %rh7, [%rd5+3]; <
st.global.u8 [%rd8+3], %rh7; <

wow! I didn’t know this! Thanks for pointing this out :)