Neural Network Optimisation

So in learning CUDA I’m trying to optimise a simple neural network algorithm. The task is to multiply a matrix (weights) by a vector (activity) and then pass the resulting vector through a sigmoid function ( f(a) = 1/1+e(-a) )
BUT the matrix is small, typically around 400x400 in my case and I find a naive approach faster than CUBLAS stuff as that’s optimised for big matrix stuff. So what I’ve done is to adapt the reduction example so that for each element it loads it now loads the multiplication of the vector and matrix element, also this is now done in a 2D grid so one grid dimension is allocated to each column (or row) of the matrix to be summed.

On the whole it works but larger sizes (where the reduction sum part is split into multiple kernel calls) goes wrong.
Also while this is faster than a niave approach, it’s not a huge improvement so any and all advice on improving this is most definitely welcome, as I said I’m still learning.

I expect the grid dimensions could be better, anyway here is the code… (8.12 KB)