Looking for help Optimising the run time of a kernel

Hello,

I have been working on a project for university which involves using a pre-provided library (a link to which is contained below) to do some real-time Deep learning. The program is however, too slow at the moment.

I’ve managed to fix a few issues to speed up the library however the function found in the layer class (Layer::calcOutputs) is still too slow for my application. This function calls another gpu function which generates a block with a single thread for each neuron in a given layer, it then calls a function in the Neuron class in which each of these block calculates the dot product (Neuron::device_dotProduct).

My question is if there is any way to speed this up, I have looked into cublas for this but that also appears too slow for the application as I need to network to be able to work for a layer size of around 3-5k and run in time to do around 15k per second.

My hardware is a 2Gb jetson nano developer kit which has a 128 core core gpu, 8 core cpu and 1024 threads available.

github for library - CLDL-CUDA/lib at main · L-A-F-987/CLDL-CUDA · GitHub

I would calculate the minimum operations needed and the speed of the GPU to get the maximum theoretical possible speed.

Then I would use Compute Nsight to see, how many operations my kernel runs and if there are any inefficiencies.

Usually uncoalesced memory accesses are the limiting factor for not perfectly optimized kernels.

I saw that your kernels use double calculations, which are very slow on non-datacenter GPUs (to save silicon space and for product differentiation).

You also should optimize any other operation, which is not addition/subtraction/multiplication. Divison, exponents or trigonometric operations are slow. Use cached look-up tables or better approximation formulas. You can also activate the fast-math option for some speed-up. Perhaps you can replace some of the more complicated activation functions with very similar ones, which are faster to compute.

Also try to avoid any indirect memory accesses. You can use indices, but avoid linked lists and similar indirections: The threads in a warp (32 threads) should as often as possible access 32 continuous memory addresses.

Also hopefully those of your kernels running a single block with a single thread are just for debugging purposes?

1 Like

@Curefab provided good advice. By all means run the numbers, but I would not set my hopes too high. Compared to an RTX 4090, the Jetson Nano provides 0.3% of the computational throughput and 2.5% of the memory bandwidth. I would think the Jetson Nano is primarily intended as an affordable learning tool.

2 Likes

@Curefab Thanks for this, I’ll look into these. I hadn’t thought to try looking at the activation functions but I will do that too. I am relatively new to GPU programming and CUDA so the advice is really appreciated.

@njuffa Thanks, I have been working on the project for a while so I had to temper my expectations a while ago. If it doesn’t work I’ll just have to accept it, I just want to make sure I’ve not missed something obvious in the optimisation which causes me to blame the hardware while the problem was actually my own code :). .

Thanks again.

You are welcome. Compute Nsight will tell you, which instruction types slow down the kernels, you would see, whether they are the arithmetic double computations or the transcendental functions (done by the special functions units/multi-function units).

You can often optimize away divisions by calculating the reciprocal of the denominator globally once (e.g. on CPU) and then use a multiplication.