How to further speedup a CUDA code for a Convolution Neural Network?

I got a problem with the speed of my CUDA code.
I’m programming a Convolution Neural Network for an image detection / classification in c++ ( Visual Studio ).

The classification and detection works, but I have speed problems, only in the Dense Layer.

I wrote the same algorithm in CPU-code to check out the speedup. On CPU it takes 45 seconds, on GPU 6 seconds for an image with the size of 750 x 500 pixels. So it’s an improvement, but I know that my code is far from being efficient.

First of the minimized Kernel:

__global__  void     denseKernel(
            float   *denseMap,      // output image
            float   *inputImg_0,    // input image
            int     *pixel_location,// used pixel
            float   *w,             // multiplied and added weight
            int     *featuresize,
            int     *sizex,         // image width x-direction
            int     *offset,        // variable for right vector multiplication
            bool    *flag,          // flag for threshold and bias 
            float   *bias,          // bias for addition
            int     *amount_pixel
            )
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int z = 0;

    if (i < *amount_pixel)
    {
        for (     int x = 0; x < *featuresize; x++ )
            for ( int y = 0; y < *featuresize; y++ )
                denseMap[pixel_location[i]] += ( ( w[z++] )
                                                *( inputImg_0[pixel_location[i]
                                                             + *offset
                                                             + x* *sizex
                                                             + y
                                                               ]
                                                  )
                                                );
        if (*flag) // Addition of the bias and threshold in the last channel
        {   //        addition of the bias
            denseMap[pixel_location[i]] = denseMap[pixel_location[i]] + *bias;

            //// Tangen Hyperbolicus as threshold
            denseMap[pixel_location[i]] = tanh(denseMap[pixel_location[i]]);
        }
    }
}

This kernel is called 1024 times in this case. In the host code I allocated all necessary variables in global memory via cudamalloc() and cudamemcpy().

These allocations take ~50 ms only once. So it’s not the bottleneck.

The kernel is called like:

int sizeN    = amountPixel; // 750 x 500 pixels
int warpsize = 256;

dim3 blocksPerGrid( ( ( sizeN / warpsize ) + 200 ), 1, 1 );
dim3 threadsPerBlock( warpsize, 1, 1 );

// Launch a kernel on the GPU with one thread for each element.
denseKernel << < blocksPerGrid, threadsPerBlock >> > 
(
    *dev_denseMap,
     dev_inputImg_0,
     dev_pixel_location,
     dev_w,
     dev_featuresize,
     dev_sizex,
     dev_offset,
     dev_fresize,
     dev_flag,
     dev_bias,
     dev_amountPixel
);

The actual parallelization is that I compute all necessary pixels with the GPU.

I trained the weights with Theano / Lasagne and CUDA / cuDNN implementation. I then used this weights to optimize the CNN for my special needs. In Theano a pixelwise classification took ~40 s.

I know that the bottleneck in my denseKernel()-code are the two for()-loops, but I do not know how to parallelize these loops in addition to the first SIMT-parallelization in the same kernel.

I posted this question first at stackoverflow, but apparently they said it’s not the rigth forum for my question.
https://stackoverflow.com/questions/46910155/how-to-further-speedup-a-cuda-code-for-a-convolution-neural-network/46917294#46917294

I hope you understand my question and thank you in advance!

why didn’t you use the cuDNN directly? it’s a optimized library by NVIDIA for deep learning, it’s can be efficient for you to construct your CNN model.

i is constant per thread, so is pixel_location[i].

I suggest to keep the value of denseMap[pixel_location[i]] in a register during summation

Only write it out after summation

use const restrict qualifiers on input arrays that won’t be modified, such as inputImg_0 and pixel_location. This helps in performing data loads through the GPU’s caches.

how big is *featuresize, can it be assumed to be constant? if so, consider unrolling the loops to the known featuresize with a #pragma unroll statement.

Ask Scott Gray for help. Whether that is a feasible approach I don’t know, but it certainly is a very effective one.
Short of that, check out what he has to write on the subject.