Neural Network Code optimization

Hi,
I should optimize the running time of a simple neural network (MLP) of 13 layers.
The code do multiplication, addition and relu operations. I multiply weights*in, add bias and then apply a relu function.

__ global__ void kernel_applyweightandbias(float *in, float *weights, float *bias, float *out, const int input_size, const int output_size){

int tid=threadIdx.x+blockIdx.x*blockDim.x;
float sum=0;

if(tid<output_size){
    sum = bias[tid];
    for(int i=0; i<input_size; i++) {
        sum += in[i]*weights[tid*input_size+i];
    }
    out[tid]=sum;
    if (out[tid] <= 0) {
        out[tid] = 0;
    }
}

}

I tried to use half precision float but performance is not improved.

__ global__ void kernel_applyweightandbias(half *in, half *weights, half *bias, half out, const int input_size, const int output_size){
int tid=threadIdx.x+blockIdx.x
blockDim.x;
//float sum=0;
half sum;

if(tid<output_size){
    sum = bias[tid];
    for(int i=0; i<input_size; i++) {
  sum = __hfma(in[i], weights[tid*input_size+i], sum); 
    }

if ( __hle(sum , __float2half(0.0)))
sum= __float2half(0.0);

    out[tid]=sum;
}

}

Any idea how to optimize this code ?
Please note that the target hardware is an Nvidia Jetson Nano with an architecture sm53.

  1. This is not how you do code formatting. One approach to code formatting is to paste your code into the edit window, select the code, and click the </> button at the top of the edit window.

  2. There is not much to this code. The most obvious suggestion I can make is to rearrange your weights storage so that adjacent threads can access adjacent weights. This probably means that you make the weights for a particular neuron input occupy a column in memory, rather than a row. Your objective is to end up with a line of code something like this:

    sum += in[i]*weights[tid+input_size*i];
    

In addition, change this:

out[tid]=sum;
if (out[tid] <= 0) {
    out[tid] = 0;
}

To this:

if (sum < 0) sum = 0;
out[tid]=sum;

An MLP layer update can usually be crafted as a matrix-vector multiply, this may end up being more performant (e.g. using a cublas call) than writing the code yourself. You would still have to do the RELU op as a separate step. For best performance, you should probably consider using libraries like cuDNN or higher level techniques like a framework such as pytorch.

What Robert Crovella said. This code, as shown, seems to be performance limited by memory throughput. I would suggest use of the CUDA profiler to confirm.

For clamping floating-point data to a particular range I would suggest use of the standard math functions fmax and fmin, which map to machine instructions. Here we might use:

out[tid] = fmaxf (sum, 0.0f);