aligned and misaligned structures

Bin_Li · September 29, 2009, 7:46pm

In CUDA SDK: [url=“http://www.nvidia.com/content/cudazone/cuda_sdk/CUDA_Basic_Topics.html#alignedTypes”]http://www.nvidia.com/content/cudazone/cud...ml#alignedTypes[/url]

I got the following output:

Testing misaligned types…
…
RGBA8_misaligned…
Avg. time: 196.819717 ms / Copy throughput: 0.236593 GB/s.
TEST PASSED
…
Testing aligned types…
…
RGBA8…
Avg. time: 5.652469 ms / Copy throughput: 8.238193 GB/s.
TEST PASSED
…

The results shows about 35 times differences between RGBA8_misaligned and RGBA8, however, I don’t understand the real differences between them, they are defined as:

typedef struct{
unsigned char r, g, b, a;
} RGBA8_misaligned;

typedef struct align(4){
unsigned char r, g, b, a;
} RGBA8;

Both struct have size of 4 bytes.

I understand if you define a struct with size 3 bytes, you may need to adjust to 4 bytes, or in the physical memory, the starting memory address affects the speed too. but, it doesn’t seem to be a reason here.

So what make the differences?

Bin_Li · September 29, 2009, 9:20pm

It turns out that adding align keyword lead to optimized ptx codes

// with align keyword

    ld.global.v4.u8         {%r7,%r8,%r9,%r10}, [%rd5+0];
    st.global.v4.u8         [%rd8+0], {%r7,%r8,%r9,%r10};

// without align keyword
ld.global.u8 %rh4, [%rd5+0]; <
st.global.u8 [%rd8+0], %rh4; <
ld.global.u8 %rh5, [%rd5+1]; <
st.global.u8 [%rd8+1], %rh5; <
ld.global.u8 %rh6, [%rd5+2]; <
st.global.u8 [%rd8+2], %rh6; <
ld.global.u8 %rh7, [%rd5+3]; <
st.global.u8 [%rd8+3], %rh7; <

_teju · September 30, 2009, 5:30am

wow! I didn’t know this! Thanks for pointing this out :)

Topic		Replies	Views
CUDA memory alignment performance and wrong output results CUDA Programming and Performance cuda	0	449	January 13, 2022
Question about CUDA sample "alignedTypes" CUDA Programming and Performance	2	436	September 4, 2018
_align is slower than not _aligned! _align CUDA Programming and Performance	3	2470	February 10, 2008
Understanding misaligned access pattrerns CUDA Programming and Performance	2	23	October 12, 2024
Global Memory access throughput between 16 bytes and 32 bytes structure CUDA Programming and Performance cuda	0	454	January 16, 2022
bug (?) with GCC __attribute__ ((aligned (16))); memory alignment corrupts data CUDA Programming and Performance	2	1148	September 10, 2011
access violation: mis ld CUDA-MEMCHECK	3	1726	November 7, 2016
Memory checker bug? CUDA Programming and Performance	1	616	September 4, 2017
Misaligned Data Access Has No Effect on Performance? CUDA Programming and Performance	7	2126	May 24, 2018
cuda passing user defined structure to a kernel failed CUDA Programming and Performance	3	1194	January 26, 2015

aligned and misaligned structures

Related topics