Neural Network Code optimization

Moby75 · April 19, 2021, 6:26pm

Hi,
I should optimize the running time of a simple neural network (MLP) of 13 layers.
The code do multiplication, addition and relu operations. I multiply weights*in, add bias and then apply a relu function.

__ global__ void kernel_applyweightandbias(float *in, float *weights, float *bias, float *out, const int input_size, const int output_size){

int tid=threadIdx.x+blockIdx.x*blockDim.x;
float sum=0;
if(tid<output_size){
    sum = bias[tid];
    for(int i=0; i<input_size; i++) {
        sum += in[i]*weights[tid*input_size+i];
    }
    out[tid]=sum;
    if (out[tid] <= 0) {
        out[tid] = 0;
    }
}
}

I tried to use half precision float but performance is not improved.

__ global__ void kernel_applyweightandbias(half *in, half *weights, half *bias, half out, const int input_size, const int output_size){
int tid=threadIdx.x+blockIdx.xblockDim.x;
//float sum=0;
half sum;
if(tid<output_size){
    sum = bias[tid];
    for(int i=0; i<input_size; i++) {
  sum = __hfma(in[i], weights[tid*input_size+i], sum); 
    }
if ( __hle(sum , __float2half(0.0)))
sum= __float2half(0.0);
    out[tid]=sum;
}
}

Any idea how to optimize this code ?
Please note that the target hardware is an Nvidia Jetson Nano with an architecture sm53.

Robert_Crovella · April 19, 2021, 6:37pm

This is not how you do code formatting. One approach to code formatting is to paste your code into the edit window, select the code, and click the </> button at the top of the edit window.
There is not much to this code. The most obvious suggestion I can make is to rearrange your weights storage so that adjacent threads can access adjacent weights. This probably means that you make the weights for a particular neuron input occupy a column in memory, rather than a row. Your objective is to end up with a line of code something like this:
```
sum += in[i]*weights[tid+input_size*i];
```

In addition, change this:

out[tid]=sum;
if (out[tid] <= 0) {
    out[tid] = 0;
}

To this:

if (sum < 0) sum = 0;
out[tid]=sum;

An MLP layer update can usually be crafted as a matrix-vector multiply, this may end up being more performant (e.g. using a cublas call) than writing the code yourself. You would still have to do the RELU op as a separate step. For best performance, you should probably consider using libraries like cuDNN or higher level techniques like a framework such as pytorch.

njuffa · April 19, 2021, 7:09pm

What Robert Crovella said. This code, as shown, seems to be performance limited by memory throughput. I would suggest use of the CUDA profiler to confirm.

For clamping floating-point data to a particular range I would suggest use of the standard math functions fmax and fmin, which map to machine instructions. Here we might use:

out[tid] = fmaxf (sum, 0.0f);

Topic		Replies	Views
tips for optimising my neural net kernel CUDA Programming and Performance	0	1354	January 20, 2010
Reduction kernel for neural networks CUDA Programming and Performance	1	1680	July 16, 2009
How to further speedup a CUDA code for a Convolution Neural Network? CUDA Programming and Performance	4	1700	December 10, 2018
tips for optimising my neural net kernel CUDA Programming and Performance	0	1398	January 19, 2010
Large matrix multiplication for neural network purpose CUDA Programming and Performance	1	801	October 6, 2016
Half performance on a100 CUDA Programming and Performance	0	560	January 13, 2021
Poor half performance CUDA Programming and Performance	13	2654	June 19, 2025
Neural Network Optimisation CUDA Programming and Performance	0	603	December 27, 2010
Looking for help Optimising the run time of a kernel CUDA Programming and Performance jetson	4	122	October 11, 2024
Improving code execution speed using NVIDIA H100 for training Spiking Neural Network CUDA Programming and Performance kernel , python	2	830	January 25, 2024

Neural Network Code optimization

Related topics