Restrict usage full overlapping element-by-element processing

Chacha21 · March 6, 2021, 12:04pm

Hello,

I generally understand the __restrict__ usage. But the examples do not show a very basic use case.
Let’s assume we want to perform an element-by element processing, like a[i] = b[i]+c[i], without reusing a[i],b[i] and c[i] in the remaining kernel instructions, and that
-a could be an alias of b
-a could be an alias of c
-b could be an alias of c
-when I mean “alias”, i mean exactly the same pointer, not a shifted overlapping version

In this case, I don’t understand why __restrict__ could not be used, because I can’t imagine a scenario where an optimized cache usage or instruction reordering would trigger a wrong result, since all elements are independent

Same question, for a processing like a[i] = sqrt(b[i])

njuffa · March 6, 2021, 6:37pm

__restrict__ is simply a promise the programmer makes to the compiler: the __restricted__ pointer is the only way via which the data object it points to is accessed within the scope of that pointer. A programmer is free to break that promise: in that case the code may work as intended or it may not. The CUDA semantics are consistent with how restrict was first defined in ISO-C99. Consistency is a good thing, it exemplifies the design principle of “least surprise”.

So what about allowing more complicated semantics in which code would also be well-defined for certain exceptional cases of __restrict__ed pointers to objects that are in fact aliasing? What practical advantage would this provide versus simply not using __restrict__ (as currently defined) in these cases?

Chacha21 · March 6, 2021, 6:39pm

I understand that breaking the promise leads to undefined behaviour. But in the case I quoted : what could go wrong in instruction scheduling ?
__restrict__is an opportunity for optimizations, so I would like to use it in that case, and I don’t understand why I should refrain myself.
If my “add” kernel is (float* dst, const float* __restrict__ src1, const float* __restrict__ src2), the restrict for src1 and src2 in CUDA allows using a highly optimized constant cache reading to minimize memory reading time.
But the doc tells I should not use it if the parameters can be aliased.

That’s why I want to understand what could happen, I see no possible failure here.
The answer is certainly very technical in the deep HW part, but that’s what I want to know.

njuffa · March 6, 2021, 6:49pm

Undefined behavior means exactly that. It does not mean the code will definitely be “broken” , i.e. behave contrary to expectations. It means the code may be broken or it may not. You identified a subset of special cases where the code would not be broken when violating the __restrict__ promise. I also cannot think of a way of how this would cause these particular idioms to misbehave (it’s possible I am missing something, haven’t had my first coffee yet :-).

But in those cases leaving off the __restrict__ also incurs zero performance penalty. I think. Have you been able to demonstrate that there actually is a difference in performance? I did not build the GPU hardware, so I am no help in a “very technical in the deep HW” discussion, I am afraid.

[later:] As a quick experiment, I compiled the following two kernels with CUDA 9.2 for sm_35. The generated code, inspected via cuobjdump --dump-sass, shows that bar uses LDG.E instructions to read, while foo uses LD.E instructions to read. Given that this is a case of streaming access with no data re-use, the HW read path should not make a difference, as the performance will be limited by memory throughput. Confirming or refuting that assumption is left as an exercise to the reader.

__global__ void foo (float *a, const float *b, const float *c, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        a[i] = b[i] + c[i];
    }
}
__global__ void bar (float * __restrict__ a, const float * __restrict__ b, const float * __restrict__ c, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        a[i] = b[i] + c[i];
    }
}

   code for sm_35
           Function : _Z3barPfPKfS1_i
   .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           /* 0x0880b8b0a0a08cc0 */
   /*0008*/                   MOV R1, c[0x0][0x44];                        /* 0x64c03c00089c0006 */
   /*0010*/                   S2R R0, SR_CTAID.X;                          /* 0x86400000129c0002 */
   /*0018*/                   S2R R3, SR_TID.X;                            /* 0x86400000109c000e */
   /*0020*/                   IMAD R0, R0, c[0x0][0x28], R3;               /* 0x51080c00051c0002 */
   /*0028*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x5b681c002b1c001e */
   /*0030*/               @P0 EXIT;                                        /* 0x180000000000003c */
   /*0038*/                   ISCADD R2.CC, R0, c[0x0][0x150], 0x2;        /* 0x60c408002a1c000a */
                                                                           /* 0x0880c48080b080ac */
   /*0048*/                   MOV32I R7, 0x4;                              /* 0x74000000021fc01e */
   /*0050*/                   IMAD.HI.X R3, R0, R7, c[0x0][0x154];         /* 0x93181c002a9c000e */
   /*0058*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;        /* 0x60c40800291c0012 */
   /*0060*/                   IMAD.HI.X R5, R0, R7, c[0x0][0x14c];         /* 0x93181c00299c0016 */
   /*0068*/                   LDG.E.T R6, [R2];                            /* 0x600210857f9c0819 */
   /*0070*/                   LDG.E.P R5, [R4];                            /* 0x600210867f9c1015 */
   /*0078*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;        /* 0x60c40800281c000a */
                                                                           /* 0x0880a30880a08cac */
   /*0088*/                   MOV R4, c[0x0][0x28];                        /* 0x64c03c00051c0012 */
   /*0090*/                   IMAD.HI.X R3, R0, R7, c[0x0][0x144];         /* 0x93181c00289c000e */
   /*0098*/                   IMAD R0, R4, c[0x0][0x34], R0;               /* 0x51080000069c1002 */
   /*00a0*/                   ISETP.LT.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x5b181c002b1c001e */
   /*00a8*/                   TEXDEPBAR 0x0;                               /* 0x77000000001c003e */
   /*00b0*/                   FADD R5, R6, R5;                             /* 0xe2c00000029c1816 */
   /*00b8*/                   ST.E [R2], R5;                               /* 0xe4800000001c0814 */
                                                                           /* 0x0800000000bc10b8 */
   /*00c8*/               @P0 BRA 0x38;                                    /* 0x12007fffb400003c */
   /*00d0*/                   MOV RZ, RZ;                                  /* 0xe4c03c007f9c03fe */
   /*00d8*/                   EXIT;                                        /* 0x18000000001c003c */
   /*00e0*/                   BRA 0xe0;                                    /* 0x12007ffffc1c003c */
   /*00e8*/                   NOP;                                         /* 0x85800000001c3c02 */
   /*00f0*/                   NOP;                                         /* 0x85800000001c3c02 */
   /*00f8*/                   NOP;                                         /* 0x85800000001c3c02 */
           ..........................


           Function : _Z3fooPfPKfS1_i
   .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           /* 0x0880b8b0a0a08cc0 */
   /*0008*/                   MOV R1, c[0x0][0x44];                        /* 0x64c03c00089c0006 */
   /*0010*/                   S2R R0, SR_CTAID.X;                          /* 0x86400000129c0002 */
   /*0018*/                   S2R R3, SR_TID.X;                            /* 0x86400000109c000e */
   /*0020*/                   IMAD R0, R0, c[0x0][0x28], R3;               /* 0x51080c00051c0002 */
   /*0028*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x5b681c002b1c001e */
   /*0030*/               @P0 EXIT;                                        /* 0x180000000000003c */
   /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;        /* 0x60c40800291c0012 */
                                                                           /* 0x0880a0b010a0ac10 */
   /*0048*/                   MOV32I R7, 0x4;                              /* 0x74000000021fc01e */
   /*0050*/                   MOV R9, c[0x0][0x28];                        /* 0x64c03c00051c0026 */
   /*0058*/                   IMAD.HI.X R5, R0, R7, c[0x0][0x14c];         /* 0x93181c00299c0016 */
   /*0060*/                   ISCADD R2.CC, R0, c[0x0][0x150], 0x2;        /* 0x60c408002a1c000a */
   /*0068*/                   LD.E R4, [R4];                               /* 0xc4800000001c1010 */
   /*0070*/                   IMAD.HI.X R3, R0, R7, c[0x0][0x154];         /* 0x93181c002a9c000e */
   /*0078*/                   LD.E R2, [R2];                               /* 0xc4800000001c0808 */
                                                                           /* 0x08b810a0fca08cb0 */
   /*0088*/                   ISCADD R6.CC, R0, c[0x0][0x140], 0x2;        /* 0x60c40800281c001a */
   /*0090*/                   IMAD.HI.X R7, R0, R7, c[0x0][0x144];         /* 0x93181c00289c001e */
   /*0098*/                   IMAD R0, R9, c[0x0][0x34], R0;               /* 0x51080000069c2402 */
   /*00a0*/                   ISETP.LT.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x5b181c002b1c001e */
   /*00a8*/                   FADD R8, R2, R4;                             /* 0xe2c00000021c0822 */
   /*00b0*/                   ST.E [R6], R8;                               /* 0xe4800000001c1820 */
   /*00b8*/               @P0 BRA 0x38;                                    /* 0x12007fffbc00003c */
                                                                           /* 0x080000000000bc10 */
   /*00c8*/                   MOV RZ, RZ;                                  /* 0xe4c03c007f9c03fe */
   /*00d0*/                   EXIT;                                        /* 0x18000000001c003c */
   /*00d8*/                   BRA 0xd8;                                    /* 0x12007ffffc1c003c */
   /*00e0*/                   NOP;                                         /* 0x85800000001c3c02 */
   /*00e8*/                   NOP;                                         /* 0x85800000001c3c02 */
   /*00f0*/                   NOP;                                         /* 0x85800000001c3c02 */
   /*00f8*/                   NOP;                                         /* 0x85800000001c3c02 */
           ..........................

Chacha21 · March 6, 2021, 7:35pm

I was inspired by https://stackoverflow.com/questions/53149193/share-large-constant-data-among-cuda-threads

And the quote :

with const ... __restrict__ . This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
[edit] found “official” reference here : https://on-demand.gputechconf.com/gtc/2012/presentations/S0642-GTC2012-Inside-Kepler.pdf

So I understand that the benefit are not really visible in the kernel compiled code but in the memory manager.
That’s why it is hard to get technical details.

njuffa · March 6, 2021, 7:39pm

The statement from the documentation is correct, and as you can see from the code I just posted, the compiler does use instructions for different read paths depending on whether __restrict__ is used or not. So yes, the effect is visible in the machine code for the kernels.

But: Your examples all use streaming idioms with no data re-use, so which caches are used for reading global memory should make no difference in performance. By all means, try it both ways and let us know what you find.

Chacha21 · March 6, 2021, 7:52pm

I would like to agree with you, but the problem with the docs for __restrict__ is that it does not cover those simplest cases.
Even if we can reasonably assume that, if src1 and src2 are in cache thanks to const __restrict__, it does not matter if there are alias or not (but the docs do not really say that), there is still at least a tricky question.
What if :
-I declare dst as __restrict__ also (not const) in the purpose of “giving the compiler an optimization opportunity” (no matter if it is real or not, just according to the docs).
-src1 is given an exact alias of dst at run-time (which should be forbidden by my explicit __restrict__ usage)

In my opinion, src1 is loaded in a cache and it will work correctly.
But since it is not mentioned explicitely, I could imagine a really twisted scenario, where the compiler do not perform any write since the underlying address is attached to some “read only” area during the kernel execution.

My problem is that testing is not an answer : it might behave correctly now and become wrong in the future with run-time or compiler evolution, since I am not complying to the __restrict__ requirement.

njuffa · March 6, 2021, 8:05pm

When I suggested experiments, I was suggesting doing that for the purpose of measuring performance, not “proving” correctness.

Your point, as I understood it, was: The description of __restrict__ in the documentation is unnecessarily restrictive because it does not take into account that in certain special cases aliasing is harmless even when I told the compiler there will be no aliasing. Following the documentation causes my CUDA code to leave performance on the table.

My point is: Yes, in certain instances aliasing via __restricted__ pointers seems to be harmless. However, in those situations there is also no performance benefit from using __restricted__ pointers. If so, unnecessarily complicating the documentation and deviating from the simple semantics introduced by ISO-C99 provides no benefit to anyone and may actually be harmful by confusing programmers.

Here is my experiment. Vector addition with and without __restricted__ pointers. No performance difference outside measurement noise level.

GPU1:

vecadd [foo]: operating on vectors of 50000000 float (= 2.000e+008 bytes)
vecadd [foo]: using 128 threads per block, 65520 blocks
vecadd [foo]: mintime = 24.639 msec  throughput = 24.35 GB/sec

vecadd [bar]: operating on vectors of 50000000 float (= 2.000e+008 bytes)
vecadd [bar]: using 128 threads per block, 65520 blocks
vecadd [bar]: mintime = 24.645 msec  throughput = 24.35 GB/sec

GPU2:

vecadd [foo]: operating on vectors of 50000000 float (= 2.000e+08 bytes)
vecadd [foo]: using 128 threads per block, 65520 blocks
vecadd [foo]: mintime = 1.604 msec  throughput = 373.97 GB/sec

vecadd [bar]: operating on vectors of 50000000 float (= 2.000e+08 bytes)
vecadd [bar]: using 128 threads per block, 65520 blocks
vecadd [bar]: mintime = 1.602 msec  throughput = 374.41 GB/sec

If I change the kernel calls from {foo|bar}<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, opts.len) to {foo|bar}<<<dimGrid,dimBar>>>(d_a, d_b, d_b, opts.len) to introduce aliasing, I get this on GPU2:

vecadd [foo]: operating on vectors of 50000000 float (= 2.000e+08 bytes)
vecadd [foo]: using 128 threads per block, 65520 blocks
vecadd [foo]: mintime = 1.145 msec  throughput = 524.15 GB/sec

vecadd [bar]: operating on vectors of 50000000 float (= 2.000e+08 bytes)
vecadd [bar]: using 128 threads per block, 65520 blocks
vecadd [bar]: mintime = 1.142 msec  throughput = 525.39 GB/sec

No difference in perfomance (note that the GB/sec numbers are off because the calculation assumes the code is operating on three vectors, while we are only using two in this case).

The program:

#include <stdlib.h>
#include <stdio.h>

#define VECADD_THREADS  128
#define VECADD_DEFLEN   50000000
#define VECADD_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif


__global__ void foo (float *a, const float *b, const float *c, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        a[i] = b[i] + c[i];
    }
}
__global__ void bar (float * __restrict__ a, const float * __restrict__ b, const float * __restrict__ c, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        a[i] = b[i] + c[i];
    }
}

struct vecaddOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct vecaddOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    float *d_a = 0, *d_b = 0, *d_c = 0;
    int errors;
    struct vecaddOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : VECADD_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_c, sizeof(d_c[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0xff, sizeof(d_a[0]) * opts.len)); // NAN
    CUDA_SAFE_CALL (cudaMemset(d_b, 0x00, sizeof(d_b[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_c, 0x00, sizeof(d_c[0]) * opts.len)); // zero

    /* Compute execution configuration */
    dim3 dimBlock(VECADD_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("vecadd [foo]: operating on vectors of %d float (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("vecadd [foo]: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < VECADD_ITER; k++) {
        start = second();
        foo<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }

    printf ("vecadd [foo]: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (3 * 1.e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    printf ("\n");

    printf ("vecadd [bar]: operating on vectors of %d float (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("vecadd [bar]: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);


    mintime = fabs(log(0.0));
    for (int k = 0; k < VECADD_ITER; k++) {
        start = second();
        bar<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("vecadd [bar]: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (3 * 1.e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));
    CUDA_SAFE_CALL (cudaFree(d_c));

    return EXIT_SUCCESS;
}

Chacha21 · March 6, 2021, 9:24pm

The description of __restrict__ in the documentation is unnecessarily restrictive because it does not take into account that in certain special cases aliasing is harmless

Indeed, this is what I think

If so, unnecessarily complicating the documentation and deviating from the simple semantics introduced by ISO-C99 provides no benefit to anyone and may actually be harmful by confusing programmers.

I think the doc is indeed sparse for clarity, but for me it is just holes that I need to fill to perfectly masterize the guarantees.

Yes, in certain instances aliasing via __restricted__ pointers seems to be harmless. However, in those situations there is also no performance benefit from using __restricted__ pointers

That’s what your experiment shows, but only because the compiler is smart enough in such simple cases. Let’s be cheeky:

if (syracuse_algorithm(b[i]) > 1)//[https://en.wikipedia.org/wiki/Collatz_conjecture]
  dst[i] = 0;//will never occur at run-time, but would be a problem if dst was a src alias
dst[i]+=src1[i]+src2[i];

in that case, the compiler will certainly not be able to guess, so the developer has to specify restrict to let the “const cache” optimization occur for src1 and src2. And Therefore, the wise developer (me :-) ) will wonder if aliasing src1, src2 and dst could then lead to undefined behaviour while it seems very unlikely

(of course, the use of the syracuse algorithm is just a simple unrealistic exemple, it should mean "any code that the compiler won’t be able to guess if aliasing is a problem)

njuffa · March 6, 2021, 9:45pm

It has nothing to do with compiler smarts. Note that the kernels foo and bar in my example actually use different load paths, based on the promise given by the programmer with __restrict__. The compiler simply uses that information, no smarts or guessing involved. So, for example on sm_70 the kernel variant with __restrict__ pointers uses the LDG.E.CONSTANT.SYS instruction to read the source data, while the kernel without restrict uses the LDG.E.SYS instruction to read the source data.

There is a difference between [1] “x invokes undefined behavior but in situation y it happens to work as desired”, and [2] “x invokes undefined behavior, except in situation y, where the behavior is actually well defined”. The current state of affairs in CUDA is [1]. What I perceive you would like it to be is [2].

My point is: While choosing the semantics of [2] is theoretically possible, it comes with no potential upside (in particular, no performance benefit) but with potential downside (programmer confusion due to more complicated semantics). Choosing [2] is therefore not advised.

Chacha21 · March 6, 2021, 10:09pm

Right, but my initial idea by opening this post is that it might not be undefined behaviour in the initial exposed case!

The problem is that we lack the full specifications to determine that.
The overall idea is “when data reuse is involved, __restrict__ will be undefined” : no problem with that
But what the doc tells is slightly different : “__restrict__ is good for perfs, but when pointers overlap, this is a problem”.
And my concern is “ok, but in the special case of overlapping with no data-reuse, is it really undefined behaviour ?”

njuffa · March 6, 2021, 10:41pm

I don’t think this is what the CUDA docs tell us. It seems to me they tell us:

[1] Use of __restrict__ can be beneficial for performance in some case, but it may not make a difference in others.
[2] Use of __restrict__ invokes undefined behavior in the presence of aliasing (thus violating the promise given by using __restrict__).

Undefined behavior occurs where-ever the designers of a programming language define it to be. This is, to first order, orthogonal to the capabilities of specific hardware. Generally this is done to provide for ease of compiler implementation, improved code performance, or to provide flexibility in the face of differences in hardware behavior.

For example, in C and C++ (and thus CUDA) overflow in signed integer arithmetic computation invokes undefined behavior. If one wants integer overflow with well-defined behavior, e.g. 2ⁿ wrap-around, one must use unsigned integer computation. Note that most hardware (all I have ever used since 1981) actually does have well-defined behavior for overflow in signed integer computation. But guaranteeing a desired wrap-around behavior across all supported hardware can at times impose overhead. i.e. a performance penalty. Therefore it is a best practice to make all integers int, unless there is a compelling reason for them to be some other type.

As we have have pretty much worked out in this thread, the cases where we can reason that violating the __restrict__ promise is actually harmless on current GPUs are also exactly those where the use of __restrict__ does not provide a performance benefit. Worked counterexamples (i.e. a program that demonstrates otherwise) welcome.

From [1] and [2] follows the best practice: If the programmer can guarantee no aliasing to occur, use __restrict__, as this may help performance. FWIW, the use of the read-only read path is not necessarily the primary benefit of using __restrict__. In fact there is no promise that a GPU actually provides such a special read-only read path. A common benefit from using __restrict__ is that it gives the compiler much greater freedom in scheduling loads relative to stores, thus often improving the latency tolerance of the code.

Chacha21 · March 6, 2021, 10:48pm

Thanks. If I come up with a counter example some day, I will add to this thread.

[edit]
I think I came up with an idea. I will test next week.
Imagine a kernel(dst, src1, src2) of ints (to avoid the infinity and nan)
The code :
for each dst(x, y), dst(x,y)=min(dst(x, y), local neighborhood of src1(x, y))
syncthreads()
for each dst(x, y), dst(x,y)=min(dst(x, y), local neighborhood of src2(x, y))
(more precisely, “for each(x,y)” should be instead “for each center(x,y) of non overlapping neighborhoods”)

this will involve a lot of fetch from src1 and src2, they should be in constant cache for perf. But because of potential side effects of writing to dst, that might be an alias of src1 or src2, the compiler will not allow that

However, we also understand that if dst is an exact alias of src1 or src2, the numerical result must be the same.

And in that case :
-__restrict__would give a performance advantage
-but according to the doc, it would lead to undefined behaviour when aliasing
-we hope that such undefined behaviour would not occur, since mathematically there would be no problem

I really have to test ! Too bad I don’t have a CUDA GPU at home, I have to wait a few days.

Chacha21 · March 10, 2021, 3:00pm

I confirm what I said. With the following kernel, we are exactly in the case where __restrict__ gives a visible performance boost (x2 on my test machine), while the lack of specifications forbids its usage in the case of exact aliasing (dst == src1 and/or dst == src2 and/or src1 == src2), despite the fact that it would theoretically give the same numerical result.

In each tiling (non-overlapping) region of size (neighborhoodSize x neighborhoodSize ), with (x,y) being the single pixel in the center of the region, dst(x,y) is replaced by the min between dst(x,y) and the min() of the matching region in src1 and src2

//please note that imSize.z is the stride in number of elements, not bytes
__global__ void testDefaultKernel(pixel_t* dst, const pixel_t* src1, const pixel_t* src2, int3 imSize, int neighborhoodRadius)
{
  const int neighborhoodSize = 2*neighborhoodRadius+1;
  const int start_x = (blockIdx.x*blockDim.x+threadIdx.x)*neighborhoodSize;
  const int start_y = (blockIdx.y*blockDim.y+threadIdx.y)*neighborhoodSize;
  const int x = start_x+neighborhoodRadius;
  const int y = start_y+neighborhoodRadius;
  if ((x<imSize.x) && (y<imSize.y))
  {
    pixel_t inf = dst[y*imSize.z+x];

    for(int ty = start_y ; ty<start_y+neighborhoodSize ; ++ty)
      for(int tx = start_x ; tx<start_x+neighborhoodSize ; ++tx)
        if ((tx>=0) && (tx<imSize.x) && (ty >= 0) && (ty<imSize.y))
          inf = min(inf, src1[ty*imSize.z+tx]);

    __syncthreads();

    for(int ty = start_y ; ty<start_y+neighborhoodSize ; ++ty)
      for(int tx = start_x ; tx<start_x+neighborhoodSize ; ++tx)
        if ((tx>=0) && (tx<imSize.x) && (ty >= 0) && (ty<imSize.y))
          inf = min(inf, src2[ty*imSize.z+tx]);

    dst[y*imSize.z+x] = inf;
  }//end if ((x<imSize.x) && (y<imSize.y))
}
//end testDefaultKernel()

And to call it :

    const int neighborhoodSize = 2*neighborhoodRadius+1;
    dim3 dimBlock = dim3(16, 16);
    dim3 dimGrid = dim3(iDivUp(imSize.x, (int)dimBlock.x*neighborhoodSize), iDivUp(imSize.y, (int)dimBlock.y*neighborhoodSize));

    testDefaultKernel<<<dimGrid, dimBlock, 0, stream>>>(dst, src1, src2, imSize, neighborhoodRadius);
    checkLastCudaError(true);

njuffa · March 10, 2021, 7:58pm

There is no forbidden usage of __restrict__, it is simply a question of what the compiler folks are willing to guarantee. If you would like to see a change in documentation or product specifications, you can file a well-reasoned proposal as an enhancement request with NVIDIA. You can use the bug reporting form (see sticky notice at the top of the forum) for that and prefix the synopsis with “RFE:” to mark it as an enhancement request (as opposed to a functional bug).

Chacha21 · March 11, 2021, 9:17am

I followed your advice, I was using the forum because I didn’t know that “bug” form could be used for RFE

Topic		Replies	Views
CUDA Pro Tip: Optimize for Pointer Aliasing Technical Blog	13	1172	April 11, 2019
Using __restricct__ in CUDA is not giving any significant performance benifit CUDA Programming and Performance	2	588	February 9, 2023
Does CUDA harness the restrict functionality? CUDA Programming and Performance	9	2378	September 2, 2016
Unexpected behavior with __restrict__ keyword? CUDA Programming and Performance	0	492	August 6, 2019
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4746	May 24, 2018
__restrict__ - where must I have it? CUDA Programming and Performance	1	1493	April 28, 2016
Clarification of __restrict__ in cuda CUDA Programming and Performance compile	1	1104	September 12, 2020
__restrict__ seems to be ignored for base pointers in structs. having base pointers with restrict as kernel arguments directly works as expected CUDA Programming and Performance	12	2264	October 17, 2020
CUDA performance struct vs multiple variables CUDA Programming and Performance	12	3444	March 16, 2015
Question about restricted pointers CUDA Programming and Performance	1	503	October 14, 2016

Restrict usage full overlapping element-by-element processing

Related topics