Difference between raw pointer and reference

  1. Is there no significant overhead of using pointer vs reference in global and device kernels?

  2. Can references benefit from

__restrict__

as raw pointers do?

__device__ void foo( int& a, int& b){
...
}

You generally can’t use a reference as a kernel argument, unless you are using Unified Memory.

tensorflow seems to heavily use references in device and global kernels. Could you point me to a good resource of how / when to use reference? I could only find very brief mentions of reference in CUDA C programming guide.

reference is a concept that is not specific to CUDA. It is part of C++ like many other things, and CUDA mostly adheres to C++.

You’re not going to find a tutorial on references in CUDA just like you’re not going to find a tutorial on function overloading in CUDA. It’s part of C++. It should behave approximately the same way.

The general usage of references in CUDA C++, and the motivating reasons to use them, aren’t going to be any different than what they are in C++.

I’m not aware of any claims that CUDA makes about using __restrict__ with references, or even if it is allowed or even if it makes sense.

A __restrict__ decorator is a contract that you are making with the compiler, that essentially says that you won’t use pointer aliasing. But a reference is an alias for the variable that it is a reference to. So I’m not sure using the two together makes sense.

Since pass by value to a function makes a copy of the variable, one of the important benefits of using references is that you don’t need to make unnecessary copies. It also sorts out some vexing use cases where you would like the function to be able to modify the contents of an object or variable in the calling environment, without having to resort to using pointer-to-pointer (which is the C style approach to sorting that out.)

This will work on Grace Hopper

As the double-underscore prefix to the symbol name indicates, __restrict__ is a vendor-specific extension to C++ and CUDA, inspired by restrict in the ISO-C99 standard. The semantics of __restrict__ are roughly the same with different toolchain vendors.

Whether CUDA allows __restrict__ for references is therefore a question of the compiler, not the target architecture. I did a quick spot check with nvcc 11.8 and restricted references are accepted. See simple example program below.

As has been stated, __restrict__ is simply a promise a programmer makes to the compiler that there is no local aliasing, which may allow the compiler to pursue more aggressive optimizations, in particular moving load instructions to an earlier point then otherwise possible. Whether the promise holds true is not checked by the compiler, for example in the program below one can change the function call to fn (a, b, b), and no error will be generated.

Since restricted references are supported by the CUDA toolchain using restricted references whereever one would otherwise use restricted pointers seems fine to me. One caveat: For ___host__ __device__ code, one would have to check whether the host toolchain also supports restricted references.

Generally speaking, references are implemented using pointers “under the hood”, but other than providing syntactic sugar (no explicit de-referencing) they are always associated with an existing data object, thus cannot be null or uninitialized. And while a pointer can point to different data objects during its lifetime, a reference is simply a different name for the one specific data object it was originally bound to. This makes references safer but less versatile than pointers (one cannot perform arithmetic on a reference, unlike a pointer), and in C++, the use of references is often preferred over the use of pointers.

#include <stdio.h>
#include <stdlib.h>

__device__ void fn (int *__restrict__ rptr, 
                    int &__restrict__ rref_1,
                    int &__restrict__ rref_2)
{
    rref_2 = -1;
    printf ("fn: *rptr=%d rref_1=%d rref_2=%d\n", *rptr, rref_1, rref_2);
}

__global__ void kernel (int a, int b, int c)
{
    printf ("kernel in: a=%d b=%d c=%d\n", a, b, c);
    fn (&a, b, c);
    printf ("kernel out: a=%d b=%d c=%d\n", a, b, c);
}

int main (void)
{
    int a = 47, b = 11, c = 0;
    printf ("main: a=%d b=%d c=%d\n", a, b, c);
    kernel<<<1,1>>>(a, b, c);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}