CUDA Pro Tip: Optimize for Pointer Aliasing

jwitsoe · August 8, 2014, 1:31am

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application…

klaus.leppkes · August 23, 2014, 9:05am

Hi Jeremy,

If I use plain (restrict annotated) pointer arguments for a kernel directly, the behavior is what I expect and what you described. Still, if I use restrict annotated pointers inside a struct, it’s compiling, but the compiler doesn’t seem to take advantage of this.
See __restrict__ seems to be ignored for base pointers in structs. having base pointers with restrict as kernel arguments directly works as expected for tiny examples.

So to put it in short: Is there a trick to specify such annotation using struct of arrays in a way that nvcc takes advantage of this?

Thanks you and kind regards,
Klaus

PS: It’s not only nvcc, clang behaves very similar. So I might be a front end problem in clang which nvcc inherits, I am just speculating here.

anon66475735 · August 8, 2014, 3:46pm

I do love the restrict keyword (and const, despite being told in a previous life that you didn't really need the const keyword). The Restrict Contract is an amusing way of putting down what the keyword means. A revised version was on my door for a while: http://cellperformance.beyo...

anon48548833 · August 22, 2014, 7:27am

Why use pointers anyway if L2 is big enough?

anon17946776 · August 22, 2014, 9:41am

Is there a pragma to specify a loop has no loop-dependencies instead?

anon39484864 · August 22, 2014, 1:46pm

When you write

"By giving a pointer the restrict property, the programmer is promising the compiler that any data accessed through that pointer is not accessed in any other way. In other words, the compiler doesn’t have to worry about aliasing when using a pointer with the restrict property."

This is not true.

Here is a trivial example:

void example1(restrict float *a, restrict float *b, float *c, int i) {
c[i] = a[i] + b[i];
}

In this example, a and b may alias (and a variant of this example is given in the C99 standard as how two restricted variables may still alias)

C99/C11 require that the restricted storage be modified in the restrict block for any guarantee to hold.
In fact, there are all sorts of messy/strange requirements on restrict that make it almost impossible for users to reason about whether it will still make their pointers alias. This is one reason that C++ does not have restrict in the standard yet. After implementing conforming support in GCC/LLVM, the number of issues where the compiler could not do what the user wanted, and the user became seriously confused as to why two restricted pointers were thought aliasing, made them sit down and think about whether it was really the right model.

anon53219476 · August 23, 2014, 2:30pm

I've been trying to understand pointer aliasing and performance, and am glad I came across this article. However, tried compiling example1 with and without __restrict from Visual Studio (Microsoft Optimizing Compiler Version 18.00.30723.0). Admittedly I'm very new to attempting to read x86 assembly, but here's what I get (both are in release configuration for optimization).

In both cases we have:
_TEXTSEGMENT
_a$ = 8; size = 4
_b$ = 12; size = 4
_c$ = 16; size = 4
_i$ = 20; size = 4

Original
; 3 : void example1(float* a, float* b, float* c, int i) {

pushebp
movebp, esp

; 4 : a[i] = a[i] + c[i];

movedx, DWORD PTR _i$[ebp]
movecx, DWORD PTR _c$[ebp]
moveax, DWORD PTR _a$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, DWORD PTR [eax+edx*4]
movssDWORD PTR [eax+edx*4], xmm0

; 5 : b[i] = b[i] + c[i];

moveax, DWORD PTR _b$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, DWORD PTR [eax+edx*4]
movssDWORD PTR [eax+edx*4], xmm0

; 6 : }

popebp
ret0

__restrict
; 3 : void example1(float* __restrict a, float* __restrict b, float* __restrict c, int i) {

pushebp
movebp, esp

; 4 : a[i] = a[i] + c[i];

movedx, DWORD PTR _i$[ebp]
moveax, DWORD PTR _c$[ebp]
movecx, DWORD PTR _a$[ebp]
movssxmm1, DWORD PTR [eax+edx*4]

; 5 : b[i] = b[i] + c[i];

moveax, DWORD PTR _b$[ebp]
movssxmm0, DWORD PTR [ecx+edx*4]
addssxmm0, xmm1
addssxmm1, DWORD PTR [eax+edx*4]
movssDWORD PTR [ecx+edx*4], xmm0
movssDWORD PTR [eax+edx*4], xmm1

; 6 : }

popebp
ret0

In either case there are the same total operations (1 push, 1 pop, 5 mov, 4 movss, 2 addss), just in a slightly different order. Not obvious to me how one is advantageous compared to the other.

anon95180265 · August 24, 2014, 10:53pm

I've updated those two sentences to hopefully be true. The fact is that using __restrict does help performance in both the CPU and GPU examples demonstrated here, on two different compilers. Unlike the example you give, our examples all write to a restricted pointer.

anon95180265 · August 24, 2014, 10:54pm

Pointer aliasing and loop dependencies are two different, but related issues. Even if there are no loop dependencies, pointers may alias, and vice versa.

anon95180265 · August 24, 2014, 10:56pm

Did you compare wall clock run time?

anon53219476 · August 26, 2014, 2:05am

I did not. The blog post implies that pointer aliasing will result in inefficient machine code because additional load operations are necessary (one additional operation in example1, if I read it correctly). However, the assembly code I'm seeing doesn't support that. That's more what I'm trying to understand - why or why aren't there additional loads, rather than if a re-ordering of the same operations is any different.

In any event it would be really insightful if this blog entry included some assembly language examples to really drive the point home.

Thanks for the reply.

anon60850134 · September 8, 2014, 9:50am

It's worth remembering that the compiler is free to optimize however it likes, and additional information may not be used. The actual instructions generated are going to depend on both your compiler and your target architecture.

For example: if I compile the first example for a compute capability 3.5 GPU using nvcc on linux, I get the following SASS assembler (viewed using: cuobjdump --dump-sass):

Without restrict:

MOV R1, c[0x0][0x44];
MOV R9, c[0x0][0x158];
MOV32I R10, 0x4;
IMAD.U32.U32 R4.CC, R9, R10, c[0x0][0x140];
IMAD.HI.X R5, R9, R10, c[0x0][0x144];
IMAD.U32.U32 R6.CC, R9, R10, c[0x0][0x150];
LD.E R3, [R4];

IMAD.HI.X R7, R9, R10, c[0x0][0x154];
LD.E R0, [R6];
IMAD.U32.U32 R8.CC, R9, R10, c[0x0][0x148];
IMAD.HI.X R9, R9, R10, c[0x0][0x14c];
FADD R2, R3, R0;
ST.E [R4], R2;
LD.E R0, [R6];

LD.E R3, [R8];
FADD R0, R3, R0;
ST.E [R8], R0;

--

With restrict:

MOV R1, c[0x0][0x44];
MOV R7, c[0x0][0x158];
MOV32I R8, 0x4;
IMAD.U32.U32 R2.CC, R7, R8, c[0x0][0x150];
IMAD.HI.X R3, R7, R8, c[0x0][0x154];
LDG.E R3, [R2];
IMAD.U32.U32 R4.CC, R7, R8, c[0x0][0x140];

IMAD.HI.X R5, R7, R8, c[0x0][0x144];
LD.E R0, [R4];
IMAD.U32.U32 R6.CC, R7, R8, c[0x0][0x148];
IMAD.HI.X R7, R7, R8, c[0x0][0x14c];
TEXDEPBAR 0x0;
FADD R2, R0, R3;
ST.E [R4], R2;

LD.E R0, [R6];
FADD R0, R0, R3;
ST.E [R6], R0;

-----

In this case the compiler generates four loads in the non-restrict version, and two normal loads and one texture cached load in the restrict version. The latter is clearly preferred.

anon99633812 · May 27, 2016, 6:29am

It's really a cool stuff about gpu..
Thank u very much

anon70852881 · April 11, 2019, 5:34am

This is really interesting and helpful. Thank you so much.

Topic		Replies	Views
Restrict usage full overlapping element-by-element processing CUDA Programming and Performance	16	1035	October 12, 2021
Aliasing of pointers to members of objects passed to kernel CUDA Programming and Performance	13	1415	January 15, 2018
Performance slowdown when moving template parameter to function argument CUDA Programming and Performance	21	2107	September 12, 2018
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4306	May 24, 2018
__constant__ memory in function scope CUDA Programming and Performance	13	4611	June 1, 2015
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1150	March 9, 2023
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5328	October 24, 2009
Problem occurred when compiling julia_gpu.cu from Cuda by example using nvcc CUDA Programming and Performance	13	4406	November 11, 2014
GPUWorker master/slave multi-gpu approach CUDA Programming and Performance	99	132693	September 15, 2010
smart ideas for an interesting problem CUDA Programming and Performance	21	9534	December 10, 2008

CUDA Pro Tip: Optimize for Pointer Aliasing

Related topics