Will compiler optimise these memory accesses

kbam · July 11, 2013, 2:31am

Assuming Array is a float array in global memory

If I had
x = TID * 4
a = Array[ x ]
b = Array[ x + 1 ]
c = Array[ x + 2 ]
d = Array[ x + 3 ]

Will the compiler fetch the 4 elements of Array at once ?
and therefore fetch the 4 elements for a warp as 1 IO of 128 bytes ?

Thanks in advance :)

allanmac · July 11, 2013, 4:12am

No, it won’t.

Each thread will perform an uncoalesced 4-byte load.

In this situation it would make more sense to load a float4 vector type and have each thread issue a 16-byte load which will most likely get broken up into 4 (or more if unaligned) 128-byte transactions.

Example gist here.

And here’s a dump of what your snippet looks like under sm_35 and nvcc 5.5:

mem4:
.text.mem4:
        /*0008*/                S2R R4, SR_TID.X;
        /*0010*/                ISCADD R5, R4, c[0x0][0x140], 0x4;
        /*0018*/                LD R3, [R5];
        /*0020*/                ISCADD R4, R4, c[0x0][0x144], 0x4;
        /*0028*/                LD R2, [R5+0x4];
        /*0030*/                LD R1, [R5+0x8];
        /*0038*/                LD R0, [R5+0xc];
        /*0048*/                ST [R4], R3;
        /*0050*/                ST [R4+0x4], R2;
        /*0058*/                ST [R4+0x8], R1;
        /*0060*/                ST [R4+0xc], R0;
        /*0068*/                EXIT ;
.L_1:
        /*0070*/                BRA `(.L_1);
.L_19:

However, I am guessing the cache will do a pretty good job minimizing the impact of this suboptimal code.

kbam · July 11, 2013, 4:56am

Thanks.
Didn’t want to use float4 as would involve way to much code changing. Will now consider writing something to automate the changing of the code.

Oops, knew it was 512 bytes per warp, thought 1 thing and automatically typed 128…

pasoleatis · July 11, 2013, 5:08am

On the new devices these accesses will use the L2 cache, so there should not be a big change.

A quick an dirty way which will require less editing. You can use a “on the fly” type casting.

if you array is A of size(4*N). In the kenel call instead of (A,…) you can use ((float4 *) A, …) . This means that also in the kernel when you define he array you must have global kernel(float 4 *A, ). This typecasting worked for me when I did inplace transform and from real I casted it to complex which is in fact a float2.

Topic		Replies	Views
Type conversions on-board the GPU What's the most efficient way? CUDA Programming and Performance	3	3539	February 27, 2009
Global memory access CUDA Programming and Performance	2	757	August 10, 2016
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5964	November 27, 2010
Optimization opportunity for large vector access CUDA Programming and Performance	9	525	June 11, 2022
Beginner's question CUDA Programming and Performance	2	472	July 3, 2019
What is the fastest way to copy 512 bytes from global to shared memory? CUDA Programming and Performance	5	981	December 24, 2014
Bytes in shared memory CUDA Programming and Performance	8	3035	April 19, 2017
Conditions of coalescing global memory into few transactions CUDA Programming and Performance	3	666	December 23, 2019
Coalesced Memory Read Question CUDA Programming and Performance	7	3039	February 24, 2016
Need help about global memory access by threads CUDA Programming and Performance	4	1185	April 6, 2010

Will compiler optimise these memory accesses

Related topics