How to eliminate address computation instructions in the SASS code

I am wondering if there is a way to truely eliminate address computation instructions when performing pointer chasing benchmarks.
For example, when I have the following kernel

__global__ void myKernel(
    float a,
    float b,
    float *array,
    unsigned int *startTime,
    unsigned int *endTime,
    unsigned int *sm_id
    )
{
    __shared__ float svalue;
    b = b - 1;
    int iter = 1000;
    unsigned int start, end;
    start=clock();
    for (int i=0; i< iter; i++){
        a=array[__float_as_uint(a)];
        repeat30(a=a+b;) 
    }
    end=clock();

    if (threadIdx.x == 0) svalue=a;
    int pid = threadIdx.x + blockIdx.x*blockDim.x;
    startTime[pid]=start;
    endTime[pid]=end;
    sm_id[pid] = get_smid();
}

The repeat30 macro is defined in repeat.h, which can be found in Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow

In the compiled sass code (using nvcc 10.2 -arch=sm_35) there are two instructions before every memory load instruction as follows:

ISCADD R8.CC, R2, c[0x0][0x148], 0x2;
IMAD.U32.U32.HI.X R9, R2, R3, c[0x0][0x14c];
LD.E R8, [R8];    

I understand that every time R8 changes, the address needs to be recomputed as R8*4+&array[0], which is done by the above two instructions ISCASS and IMAD.
My question is that can we achieve:

  1. initialized the array elements so that array[i] = &array[j] rather than array[i]=j as in the traditional pointer chasing benchmark.
  2. eliminate the two address computation instructions before every memory load instruction in the SASS code.

Are you looking at code from a debug build, or a release build? For any analysis related to performance, you would want to be looking at a release build.

Generally speaking, the CUDA compiler knows how to do strength reduction for address computation in loops and how to create induction variables for that. It has applied this optimization at least since CUDA 2.0. There may be cases where the compiler does not use that optimization because register pressure is high. This is unlikely the case here.

There are also situations where optimizations are inhibited by potential aliasing, so I would suggest to use __restrict__ with pointer arguments to functions as appropriate. Again, for a simple code like this I wouldn’t anticipate issues, but it is a good habit to get into.

Lastly, the compiler may find that creating of an induction variable is not worth it from a performance perspective. GPUs are 32-bit machines, so 64-bit pointer arithmetic must be emulated. Incrementing a pointer (induction variable) will take two instructions, as will the on-the-fly recomputation shown above.

[Later:] Looking at the code more closely, the array indexing is not simply dependent on the loop index, but based on the array data being read, so I do not see how an induction variable could possibly be derived and strength reduction be applied. Under this given constraint the code looks optimal to me. On sm_3x, two instructions are necessary to compute an address from the array index read from array[].

If your goal is to eliminate address computation, and actually create a pointer-chasing code, simply use an array of 64-bit pointers. That would give you a sequence of LD.E R8, [R8]; LD.E R8, [R8]; .... ;LD.E R8, [R8];.

Here is some host code that shows how to implement that approach. It uses an LFSR to fill a pointer array of size 2N (here: N=8) and then we chase the pointers, covering the entire array “randomly” in the process, except array element 0, as that is not reachable by a full-length LFSR.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main (void) 
{
    uintptr_t ptr_array [256];
    int mask, state, new_state;

    /* use 8-bit LFSR to initialize array */
    mask = 0x8e;
    state = 1;
    do {
        new_state = (state & 1) ? ((state >> 1) ^ mask) : (state >> 1);
        ptr_array [state] = (uintptr_t)&ptr_array [new_state];
        state = new_state;
    } while (state != 1);

    /* chase the pointers */
    uintptr_t *addr = &ptr_array[1];
    for (int i = 1; i < 256; i++) {
        printf ("[%3d] addr = %p\n", i, addr);
        addr = (uintptr_t *)(*addr);
    }
    return EXIT_SUCCESS;
}

Thanks for the reply.
Then how can we initialize the array if the array resides in the global memory space?
Is there a way to obtain the address of the array on the host?

You could initialize the array in the same kernel where you do the pointer chasing or in a separate kernel that you run before hand. You could even prepare the array content on the host and copy it over the device.

To fill in the pointer array, you only need one piece of information: the starting address of the array. cudaMalloc() provides you with that.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

__global__ void chase_pointers (uintptr_t *ptr_array)
{
    uintptr_t *addr = &ptr_array[1];
    for (int i = 1; i < 256; i++) {
        printf ("GPU: [%3d] addr = %p\n", i, addr);
        addr = (uintptr_t *)(*addr);
    }
}

int main (void) 
{
    uintptr_t ptr_array [256];
    uintptr_t *ptr_array_d = 0;
    int mask, state, new_state;

    cudaMalloc ((void **)&ptr_array_d, sizeof ptr_array);
    printf ("ptr_array_d = %p\n", ptr_array_d);

    /* use 8-bit LFSR to initialize array */
    mask = 0x8e;
    state = 1;
    do {
        new_state = (state & 1) ? ((state >> 1) ^ mask) : (state >> 1);
        ptr_array [state] = (uintptr_t)&ptr_array_d [new_state];
        state = new_state;
    } while (state != 1);

    cudaMemcpy (ptr_array_d, ptr_array, sizeof ptr_array, cudaMemcpyHostToDevice);
    chase_pointers<<<1,1>>>(ptr_array_d);
    cudaFree (ptr_array_d);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

I tried your implementation and managed to get the pointer chasing work. I also tried to add some FADD instructions between two LD.E using union. The kernel now looks like

__global__ void chase_pointers (uintptr_t *ptr_array, int *output, float b)
{
    b=b-1;
    float a;
    union
    {
        uintptr_t * addr;
        struct {
            int low;
            int high;
        } myInt;
    } myUnion;
    myUnion.addr = &ptr_array[0];
    #pragma unroll
    for (int i = 1; i < 10; i++) {
        myUnion.addr = (uintptr_t *)(*myUnion.addr);
        a=__int_as_float(myUnion.myInt.low);
        repeat10(a=a+b;);
        myUnion.myInt.low = __float_as_int(a);
    }

    output[0] = myUnion.myInt.low;
    output[1] = myUnion.myInt.high;
}

My intention is to re-produce the experiment used in Volkov’s dissertation (https://escholarship.org/content/qt1wb7f3h4/qt1wb7f3h4_noSplash_1e32f64125997ee6afa303a150338054.pdf), which uses a synthesized kernel consisting of groups of \alpha FADD instructions and 1 LD instruction.
In the above kernel, b is 0. So addr does not change. What I expected is that the pointer chasing code is not affected by the FADD instructions. However, the generated sass code (nvcc 10.2 on ubuntu 18.04) has an extra mov instruction before each LD instruction as follows

        /*00a0*/                   LD.E.64 R4, [R2];       /* 0xc5800000001c0810 */
        /*00a8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00b0*/                   MOV R7, R5;             /* 0xe4c03c00029c001e */
        /*00b8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
                                                           /* 0x08a0a0a0a0a0a0a0 */
        /*00c8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00d0*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00d8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00e0*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00e8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00f0*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
        /*00f8*/                   FADD R4, R0, R4;        /* 0xe2c00000021c0012 */
                                                           /* 0x08a0a0a0a010fca0 */
        /*0108*/                   FADD R6, R0, R4;        /* 0xe2c00000021c001a */
        /*0110*/                   LD.E.64 R4, [R6];       /* 0xc5800000001c1810 */

9 FADD instructions has the form FADD R4, R0, R4 but the last one becomes FADD R6,R0,R4. And since R6 is going to be used as the address, R7 should be correctly set. That is the reason why there is an MOV R7, R5 instruction in the middle.
My question is that why nvcc changes the last FADD instruction to use R6 as the dest reg instead of R4? If R4 is used, there should be no need to use a MOV instruction since R5 already contains the high 32 bit of the address. And is there a way to get rid of the MOV instruction?

Register allocation is something the backend of the compiler does on its own without any programmer control. That is typical of compilers, e.g. you can observe the same effect of extraneous movs when compiling with gcc for x86.

The likely trigger for the register “switching” is the use of the union with struct. I have no idea what you are trying to accomplish by modifying portions of the 64-bit pointer via floating-point additions which then necessitates the struct. You might want to try using full (instead of partial) updates of the pointer via mov.b64 with some inline PTX.