CUDA Pro Tip: Optimize for Pointer Aliasing

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application…

Hi Jeremy,

If I use plain (restrict annotated) pointer arguments for a kernel directly, the behavior is what I expect and what you described. Still, if I use restrict annotated pointers inside a struct, it’s compiling, but the compiler doesn’t seem to take advantage of this.
See __restrict__ seems to be ignored for base pointers in structs. having base pointers with restrict as kernel arguments directly works as expected for tiny examples.

So to put it in short: Is there a trick to specify such annotation using struct of arrays in a way that nvcc takes advantage of this?

Thanks you and kind regards,
Klaus

PS: It’s not only nvcc, clang behaves very similar. So I might be a front end problem in clang which nvcc inherits, I am just speculating here.

I do love the restrict keyword (and const, despite being told in a previous life that you didn't really need the const keyword). The Restrict Contract is an amusing way of putting down what the keyword means. A revised version was on my door for a while: http://cellperformance.beyo...

Why use pointers anyway if L2 is big enough?

Is there a pragma to specify a loop has no loop-dependencies instead?

When you write

"By giving a pointer the restrict property, the programmer is promising the compiler that any data accessed through that pointer is not accessed in any other way. In other words, the compiler doesn’t have to worry about aliasing when using a pointer with the restrict property."

This is not true.

Here is a trivial example:

void example1(restrict float *a, restrict float *b, float *c, int i) {
c[i] = a[i] + b[i];
}

In this example, a and b may alias (and a variant of this example is given in the C99 standard as how two restricted variables may still alias)

C99/C11 require that the restricted storage be modified in the restrict block for any guarantee to hold.
In fact, there are all sorts of messy/strange requirements on restrict that make it almost impossible for users to reason about whether it will still make their pointers alias. This is one reason that C++ does not have restrict in the standard yet. After implementing conforming support in GCC/LLVM, the number of issues where the compiler could not do what the user wanted, and the user became seriously confused as to why two restricted pointers were thought aliasing, made them sit down and think about whether it was really the right model.

I've been trying to understand pointer aliasing and performance, and am glad I came across this article. However, tried compiling example1 with and without __restrict from Visual Studio (Microsoft Optimizing Compiler Version 18.00.30723.0). Admittedly I'm very new to attempting to read x86 assembly, but here's what I get (both are in release configuration for optimization).

In both cases we have:
_TEXTSEGMENT
_a$ = 8; size = 4
_b$ = 12; size = 4
_c$ = 16; size = 4
_i$ = 20; size = 4

Original
; 3 : void example1(float* a, float* b, float* c, int i) {

pushebp
movebp, esp

; 4 : a[i] = a[i] + c[i];

movedx, DWORD PTR _i$[ebp]
movecx, DWORD PTR _c$[ebp]
moveax, DWORD PTR _a$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, DWORD PTR [eax+edx*4]
movssDWORD PTR [eax+edx*4], xmm0

; 5 : b[i] = b[i] + c[i];

moveax, DWORD PTR _b$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, DWORD PTR [eax+edx*4]
movssDWORD PTR [eax+edx*4], xmm0

; 6 : }

popebp
ret0

__restrict
; 3 : void example1(float* __restrict a, float* __restrict b, float* __restrict c, int i) {

pushebp
movebp, esp

; 4 : a[i] = a[i] + c[i];

movedx, DWORD PTR _i$[ebp]
moveax, DWORD PTR _c$[ebp]
movecx, DWORD PTR _a$[ebp]
movssxmm1, DWORD PTR [eax+edx*4]

; 5 : b[i] = b[i] + c[i];

moveax, DWORD PTR _b$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, xmm1
addssxmm1, DWORD PTR [eax+edx*4]
movssDWORD PTR [ecx+edx*4], xmm0
movssDWORD PTR [eax+edx*4], xmm1

; 6 : }

popebp
ret0

In either case there are the same total operations (1 push, 1 pop, 5 mov, 4 movss, 2 addss), just in a slightly different order. Not obvious to me how one is advantageous compared to the other.

I've updated those two sentences to hopefully be true. The fact is that using __restrict does help performance in both the CPU and GPU examples demonstrated here, on two different compilers. Unlike the example you give, our examples all write to a restricted pointer.

Pointer aliasing and loop dependencies are two different, but related issues. Even if there are no loop dependencies, pointers may alias, and vice versa.

Did you compare wall clock run time?

I did not. The blog post implies that pointer aliasing will result in inefficient machine code because additional load operations are necessary (one additional operation in example1, if I read it correctly). However, the assembly code I'm seeing doesn't support that. That's more what I'm trying to understand - why or why aren't there additional loads, rather than if a re-ordering of the same operations is any different.

In any event it would be really insightful if this blog entry included some assembly language examples to really drive the point home.

Thanks for the reply.

It's worth remembering that the compiler is free to optimize however it likes, and additional information may not be used. The actual instructions generated are going to depend on both your compiler and your target architecture.

For example: if I compile the first example for a compute capability 3.5 GPU using nvcc on linux, I get the following SASS assembler (viewed using: cuobjdump --dump-sass):

Without restrict:

MOV R1, c[0x0][0x44];
MOV R9, c[0x0][0x158];
MOV32I R10, 0x4;
IMAD.U32.U32 R4.CC, R9, R10, c[0x0][0x140];
IMAD.HI.X R5, R9, R10, c[0x0][0x144];
IMAD.U32.U32 R6.CC, R9, R10, c[0x0][0x150];
LD.E R3, [R4];

IMAD.HI.X R7, R9, R10, c[0x0][0x154];
LD.E R0, [R6];
IMAD.U32.U32 R8.CC, R9, R10, c[0x0][0x148];
IMAD.HI.X R9, R9, R10, c[0x0][0x14c];
FADD R2, R3, R0;
ST.E [R4], R2;
LD.E R0, [R6];

LD.E R3, [R8];
FADD R0, R3, R0;
ST.E [R8], R0;

--

With restrict:

MOV R1, c[0x0][0x44];
MOV R7, c[0x0][0x158];
MOV32I R8, 0x4;
IMAD.U32.U32 R2.CC, R7, R8, c[0x0][0x150];
IMAD.HI.X R3, R7, R8, c[0x0][0x154];
LDG.E R3, [R2];
IMAD.U32.U32 R4.CC, R7, R8, c[0x0][0x140];

IMAD.HI.X R5, R7, R8, c[0x0][0x144];
LD.E R0, [R4];
IMAD.U32.U32 R6.CC, R7, R8, c[0x0][0x148];
IMAD.HI.X R7, R7, R8, c[0x0][0x14c];
TEXDEPBAR 0x0;
FADD R2, R0, R3;
ST.E [R4], R2;

LD.E R0, [R6];
FADD R0, R0, R3;
ST.E [R6], R0;

-----

In this case the compiler generates four loads in the non-restrict version, and two normal loads and one texture cached load in the restrict version. The latter is clearly preferred.

It's really a cool stuff about gpu..
Thank u very much

This is really interesting and helpful. Thank you so much.