Hi, lately i have dived into CUDA C++ programming. I want to develop a preapprehension when will I gain benefits from moving my code to GPU and when its waste of time. My conclusion is very short, but I wanted to highlight the essence of GPU programming:
- Using only large number of threads wont speed up my code: example
simple array adding:
__global__
void add(type* A, type* B, type* C,int N){
if(threadIdx.x+blockIdx.x*blockDim.x<N)
C[threadIdx.x+blockIdx.x*blockDim.x]=A[threadIdx.x+blockIdx.x*blockDim.x]+B[threadIdx.x+blockIdx.x*blockDim.x];
}
Typical execution takes the same amount of time like on CPU
-Using few threads and shared memory also isn’t beneficial
Example: Neural Network Calculations
__global__
void GPU_calc_nn(float* output,float* inputs, float*wages, float*bias, int col, int row){
extern __shared__ float tab[];
//INPUT to NN
tab[threadIdx.x]=inputs[threadIdx.x];
//WRITING BIASES TO SHARED MEMORY
for(int i=1;i<row+1;++i)
tab[col*i+threadIdx.x]=bias[threadIdx.x*row+i];
__syncthreads();
//CALCULATIONS ON EVERY COLUMN OF NEURONS
for(int i=0;i<row;++i){
for(int j=0;j<col;++j)
tab[col*(i+1)+threadIdx.x]+=wages[threadIdx.x*row*col+i*col+j]*tab[col*i+j];//+bias[threadIdx.x*row+i];
__syncthreads();
}
//OUTPUT VALUES
output[threadIdx.x]=tab[col*(row)+threadIdx.x];
__syncthreads();
}
This alghoritm takes twice less time on CPU than GPU, due to limited thread number
-Using shared memory in kernel and many threads may result in acceleration, only if problem is large (while computing more than 1000 threads caused 40x acceleration)
Example: Adding neighbour values to every pixel (interpretate neighbour as a 3x3 square)
__global__
void distances(float *a, float* result, int BLOCK, int delay,int window,int W, int H) {
extern __shared__ float tab[];
for( int i=threadIdx.x ; i < BLOCK+2*delay ; i+=blockDim.x )
for(int j=window-1 ; j > 0 ; j-- )
if(i+blockDim.x*blockIdx.x < W )
tab[i + j*( BLOCK+2*delay )]=a[i+blockDim.x*blockIdx.x+(j-1)*W];
__syncthreads();
unsigned short counter=0;
int dist=0;
for(int i=delay ; i < H - delay ; ++i ){
for( int k=threadIdx.x ; k < BLOCK+2*delay ; k+=blockDim.x )
for(int j=0 ; j < window - 1 ; j++ )
if(k+blockDim.x*blockIdx.x< W ){
tab[k + j*( BLOCK+2*delay )]=tab[k + (j+1)*( BLOCK+2*delay )];
}
__syncthreads();
for( int k=threadIdx.x ; k < BLOCK+2*delay ; k+=blockDim.x )
if(k+blockDim.x*blockIdx.x< W )
tab[k + (window-1)*( BLOCK+2*delay )]=a[k+blockDim.x*blockIdx.x + (i+delay)*W];
__syncthreads();
if(delay <= delay+threadIdx.x+blockIdx.x*blockDim.x && delay+threadIdx.x+blockIdx.x*blockDim.x < W - delay ){
counter=0;
dist=0;
for(int x=-delay; x<=delay ; ++x )
for(int y=0 ; y<window ; ++y)
dist+=tab[delay+threadIdx.x+x+(y)*(BLOCK+2*delay)];
result[delay+threadIdx.x+blockIdx.x*blockDim.x+i*W]=dist;
}
__syncthreads();
}
}
which would look like on CPU:
for(int i=delay;i<W-delay;++i)
for(int j=delay;j<H-delay;++j){
int counter=0;
double dist=0;
for(int x=-delay;x<=delay;++x)
for(int y=-delay;y<=delay;++y){
dist+=(a[(i+x)+(j+y)*W]);
}
result[i+j*W]=dist;
}
-use streams while loading big amount of data to GPU only if your GPU has right architecture. Using them for less than 100mb is pointless (seemless acceleration)
-bother to move your calculations to GPU only if they are not complicated (single thread on GPU is way slower than CPU)
-divergence is only unwanted when theres code in else statement and if doesnt apply to all threads in warp
Im open for a conversation. If You disagree please share your thoughts.