How is elementwise operators fusion done by compiler?

Robert_Crovella · February 9, 2023, 1:21am

Here’s how you do kernel fusion. You discover in your code you have this:

__global__ void k1(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]*2;
}

__global__ void k2(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]+2;
}

int main(){

  ...
  k1<<<grid, block>>>(d1, r1, N);
  k2<<<grid, block>>>(r1, r1, N);
  ...
}

And so what you do as a programmer is rewrite it like this:

__global__ void k12(int *d, int *r, int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N) r[idx] = d[idx]*2+2;
}

int main(){

  ...
  k12<<<grid,  block>>>(d1, r1, N);
  ...
}

And you save one store operation and one load operation (per element). You do that by refactoring the code. The compiler doesn’t do it “for you”, doesn’t “help” and there is no way to coax this kind of transformation out of the nvcc compiler toolchain, currently. You do it. And it doesn’t require any programmer’s knowledge or control of register level behavior of the C++ compiler.

Topic		Replies	Views
Fuse Operators cuDNN	5	2638	March 31, 2021
Is it possible to overlap memory access and computation inside the same kernel? CUDA Programming and Performance	4	1263	September 16, 2022
Basic question about kernel fusion and fission CUDA Programming and Performance	5	3471	October 21, 2023
Dose cuDNN support operator fusion or graph fusion cuDNN	2	849	December 18, 2019
Kernel optimization and register usage reduction reducing the banching. CUDA Programming and Performance	7	2596	August 6, 2008
How to reduce register usage CUDA Programming and Performance	47	50150	May 28, 2022
Optimization for small kernels How to optimize small kernels with less instructions CUDA Programming and Performance	21	2686	October 22, 2010
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	10350	September 30, 2022
Can you "hide" the cost of kernels through kernel fusion? E.g the cost of matrix transpose CUDA Programming and Performance	5	596	September 21, 2022
How do you do computation using only registers? CUDA Programming and Performance	2	732	June 28, 2022

How is elementwise operators fusion done by compiler?

Related topics