Will compiler optimise these memory accesses

Assuming Array is a float array in global memory

If I had
x = TID * 4
a = Array[ x ]
b = Array[ x + 1 ]
c = Array[ x + 2 ]
d = Array[ x + 3 ]

Will the compiler fetch the 4 elements of Array at once ?
and therefore fetch the 4 elements for a warp as 1 IO of 128 bytes ?

Thanks in advance :)

No, it won’t.

Each thread will perform an uncoalesced 4-byte load.

In this situation it would make more sense to load a float4 vector type and have each thread issue a 16-byte load which will most likely get broken up into 4 (or more if unaligned) 128-byte transactions.

Example gist here.

And here’s a dump of what your snippet looks like under sm_35 and nvcc 5.5:

        /*0008*/                S2R R4, SR_TID.X;
        /*0010*/                ISCADD R5, R4, c[0x0][0x140], 0x4;
        /*0018*/                LD R3, [R5];
        /*0020*/                ISCADD R4, R4, c[0x0][0x144], 0x4;
        /*0028*/                LD R2, [R5+0x4];
        /*0030*/                LD R1, [R5+0x8];
        /*0038*/                LD R0, [R5+0xc];
        /*0048*/                ST [R4], R3;
        /*0050*/                ST [R4+0x4], R2;
        /*0058*/                ST [R4+0x8], R1;
        /*0060*/                ST [R4+0xc], R0;
        /*0068*/                EXIT ;
        /*0070*/                BRA `(.L_1);

However, I am guessing the cache will do a pretty good job minimizing the impact of this suboptimal code.

Didn’t want to use float4 as would involve way to much code changing. Will now consider writing something to automate the changing of the code.

Oops, knew it was 512 bytes per warp, thought 1 thing and automatically typed 128…

On the new devices these accesses will use the L2 cache, so there should not be a big change.

A quick an dirty way which will require less editing. You can use a “on the fly” type casting.

if you array is A of size(4*N). In the kenel call instead of (A,…) you can use ((float4 *) A, …) . This means that also in the kernel when you define he array you must have global kernel(float 4 *A, ). This typecasting worked for me when I did inplace transform and from real I casted it to complex which is in fact a float2.