CUDA principals - summary

michaldygaz · September 1, 2018, 1:27am

Hi, lately i have dived into CUDA C++ programming. I want to develop a preapprehension when will I gain benefits from moving my code to GPU and when its waste of time. My conclusion is very short, but I wanted to highlight the essence of GPU programming:

Using only large number of threads wont speed up my code: example

simple array adding:

__global__
void add(type* A, type* B, type* C,int N){
  if(threadIdx.x+blockIdx.x*blockDim.x<N)
  C[threadIdx.x+blockIdx.x*blockDim.x]=A[threadIdx.x+blockIdx.x*blockDim.x]+B[threadIdx.x+blockIdx.x*blockDim.x];
}

Typical execution takes the same amount of time like on CPU

-Using few threads and shared memory also isn’t beneficial
Example: Neural Network Calculations

__global__
void GPU_calc_nn(float* output,float* inputs, float*wages, float*bias, int col, int row){
  extern __shared__ float tab[];

//INPUT to NN
  tab[threadIdx.x]=inputs[threadIdx.x];

//WRITING BIASES TO SHARED MEMORY
  for(int i=1;i<row+1;++i)
    tab[col*i+threadIdx.x]=bias[threadIdx.x*row+i];

  __syncthreads();

//CALCULATIONS ON EVERY COLUMN OF NEURONS
  for(int i=0;i<row;++i){
      for(int j=0;j<col;++j)
        tab[col*(i+1)+threadIdx.x]+=wages[threadIdx.x*row*col+i*col+j]*tab[col*i+j];//+bias[threadIdx.x*row+i];
        __syncthreads();
  }
//OUTPUT VALUES
  output[threadIdx.x]=tab[col*(row)+threadIdx.x];
  __syncthreads();
}

This alghoritm takes twice less time on CPU than GPU, due to limited thread number

-Using shared memory in kernel and many threads may result in acceleration, only if problem is large (while computing more than 1000 threads caused 40x acceleration)

Example: Adding neighbour values to every pixel (interpretate neighbour as a 3x3 square)

__global__
void distances(float *a, float* result, int BLOCK, int delay,int window,int W, int H) {
extern __shared__ float tab[];


        for( int i=threadIdx.x ; i < BLOCK+2*delay ; i+=blockDim.x )
            for(int j=window-1 ; j > 0 ; j-- )
                if(i+blockDim.x*blockIdx.x < W )
                tab[i + j*( BLOCK+2*delay )]=a[i+blockDim.x*blockIdx.x+(j-1)*W];
        
     
        __syncthreads();

        unsigned short counter=0;
        int dist=0;

        for(int i=delay ; i < H - delay ; ++i ){
            for( int k=threadIdx.x ; k < BLOCK+2*delay ; k+=blockDim.x )
                for(int j=0 ; j < window - 1 ; j++ )
                    if(k+blockDim.x*blockIdx.x< W ){
                        tab[k + j*( BLOCK+2*delay )]=tab[k + (j+1)*( BLOCK+2*delay )];
          }

            __syncthreads();

        
            for( int k=threadIdx.x ; k < BLOCK+2*delay ; k+=blockDim.x )
                    if(k+blockDim.x*blockIdx.x< W )
                            tab[k + (window-1)*( BLOCK+2*delay )]=a[k+blockDim.x*blockIdx.x + (i+delay)*W];
                            
                    

            __syncthreads();

            if(delay <= delay+threadIdx.x+blockIdx.x*blockDim.x && delay+threadIdx.x+blockIdx.x*blockDim.x < W - delay ){
        counter=0;
        dist=0;

                for(int x=-delay; x<=delay ; ++x )
                    for(int y=0 ; y<window ; ++y)
            dist+=tab[delay+threadIdx.x+x+(y)*(BLOCK+2*delay)];
    		
    	
            result[delay+threadIdx.x+blockIdx.x*blockDim.x+i*W]=dist;           
       

        }
        __syncthreads();
}

}

which would look like on CPU:

for(int i=delay;i<W-delay;++i)
  for(int j=delay;j<H-delay;++j){
    int counter=0;
    double dist=0;
    for(int x=-delay;x<=delay;++x)
      for(int y=-delay;y<=delay;++y){
        dist+=(a[(i+x)+(j+y)*W]);
       }
        result[i+j*W]=dist;
      
    }

-use streams while loading big amount of data to GPU only if your GPU has right architecture. Using them for less than 100mb is pointless (seemless acceleration)

-bother to move your calculations to GPU only if they are not complicated (single thread on GPU is way slower than CPU)

-divergence is only unwanted when theres code in else statement and if doesnt apply to all threads in warp

Im open for a conversation. If You disagree please share your thoughts.