How is elementwise operators fusion done by compiler?

Here’s how you do kernel fusion. You discover in your code you have this:

__global__ void k1(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]*2;
}

__global__ void k2(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]+2;
}

int main(){

  ...
  k1<<<grid, block>>>(d1, r1, N);
  k2<<<grid, block>>>(r1, r1, N);
  ...
}

And so what you do as a programmer is rewrite it like this:

__global__ void k12(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]*2+2;
}

int main(){

  ...
  k12<<<grid,  block>>>(d1, r1, N);
  ...
}

And you save one store operation and one load operation (per element). You do that by refactoring the code. The compiler doesn’t do it “for you”, doesn’t “help” and there is no way to coax this kind of transformation out of the nvcc compiler toolchain, currently. You do it. And it doesn’t require any programmer’s knowledge or control of register level behavior of the C++ compiler.