Question about CUDA sample "alignedTypes"

Dear forum,
CUDA 9.2 sample code alignedTypes compares access performance for aligned and misaligned structs. I’m wondering what’s the difference between the two structs:

typedef struct
    unsigned char r, g, b, a;
} RGBA8_misaligned;
typedef struct __align__(4)
    unsigned char r, g, b, a;

The sample code uses cudaMalloc to allocate byte arrays and copies the byte array into an output array. Since both the input and output arrays are aligned to at least 256-byte boundary, and both structs have the same layout, I’m curious what’s the point of align(4), and what causes the difference in bandwidth. Why is copy of the aligned version consistently faster than the unaligned version? Thank you.

At compile-time, the compiler does not know some of the things you point out. Without an alignment guarantee specified at compile time, the compiler must generate the “safer” code that does a bytewise copy of the structure.

With the alignment hint, the compiler can generate more efficient struct-wide copies (i.e. char4 or similar - a 4-byte copy).

This should be verifiable with inspection of the generated SASS code.

cuobjdump -sass /usr/local/cuda/samples/bin/x86_64/linux/release/alignedTypes

@txbob Thank you. Very interesting to know!

On sm_35, the aligned version emits:

LD.E R2, [R2];
ST.E [R4], R2;

whereas the vanilla version emits:

LD.E.U8 R11, [R4+0x3];
LD.E.U8 R9, [R4+0x2];
LD.E.U8 R7, [R4+0x1];
LD.E.U8 R6, [R4];
ST.E.U8 [R2+0x3], R11;
ST.E.U8 [R2+0x2], R9; 
ST.E.U8 [R2+0x1], R7; 
ST.E.U8 [R2], R6;

That explains the performance gap.