Matrix multiplication woes large inner, small outer dimensions

This is very interesting, excellent work!

At first i wanted to start a separate thread but then i’ve changed my mind, since i think this is the same problem.

Intro:

Recently i’ve been trying to implement a GPU accelerated version of my fancy Genetic Algorithm that does the Neural Network (i.e. Perceptron) optimization. Perceptron is just a geeky word for multiple matrix multiplications. It’s inner signal “S” is nothing but matrix. Hidden layers L[i],i=[0…n-1] are matrices. To launch a Neural Network (Perceptron) you need to do the multiplications F(F(F(S * L[0]) * L[1]) * L[2])… where F - is just a function (for example cos(x)) that is applied to each element of a matrix. The common approach for optimizing neural networks with genetic algorithm is that you start with small perceptron layer dimensions. Such small perceptrons are easier to optimize. And when you optimize it up to it’s limits, you just expand their layer dimensions in some random matter, this allows to expand the capabilities of the neural network you are optimizing.

Specific:

For computing a pass from one layer to another i don’t use standard formula

Input * Layer + Bias = Result

my layers have 2 matrices “A” and “B” so the formula looks like:

Layer.A * Input * Layer.B + Bias = Result

This formula allows more clean mathematical notation, and gives no problems with dimensions. So that when you have input dimensions size (M x N) and you need an output dimensions size (A x B) you just do:

(A x M) * (M x N) * (N x B) + (A x B) = (A x B)

The problem:

Let’s say Input matrix size is (16 x 16) and required output size is (10 x 2)

when i start my optimizations i’m required to do more than a 1000 evaluations of the following formulas:

[codebox]

(2 x 16) * (16 x 16) * (16 x 2) = (2 x 2) //first layer received input (2 x 16) and produced (2 x 2) output

(2 x 2) * (2 x 2) * (2 x 2) = (2 x 2) //second layer received input (2 x 2) and produced (2 x 2) output

(10 x 2) * (2 x 2) * (2 x 2) = (10 x 2) //third and final layer received input (2 x 2) and produced (10 x 2) output

[/codebox]

If i will try to do this with sequent kernel calls that just multply 2 matrices, the GPU will not be used optimally due to small matrix sizes. The only strategy that solves the issue is that i load all matrices to device and do all the multiplications within one or several kernel calls. Currently i’m trying to do that with the help of “matrixMul” function that i’ve found in SDK and which i’m modifying right now my own way (Unfortunately right now i’m not able to provide a working piece of code, i will post it little bit later)

The problem is that this function also has problems and it wont handle (3 x 3) matrices efficiently. So I would like to attract your attention Guru(s) to the following problem. Not only you might need to multiply several large matrices efficiently but also you might need to multiply many small ones not less efficiently than you do that with the large ones. This problem seems like a counterpart to the one you are discussing right now. Do you find any clever solution to it, because looking to ?