multiplication of several matrices

Hello
I need to implement the multiplication of several matrices.
For example: 210 * 210 * 210 x 150 x 150 * 30. (Matrices can be any number)
All these matrices are passed to the function in the form 2D-dynamic array (float ** W):
W [0] - 210 * 210 elements
W [1] - 210 * 150 elements
W [2] - 150 * 30 elements
Also transferred to their dimensions.

I looked in the CUDA SDK. But there only for matrix width and height are multiples of 16 (BLOCK_SIZE).
My algorithm does not work (attachment).