about using batched multiplications

I have GPU card - NVIDIA Quadro K4000M 4 GB GDDR5. I’d like to make parallel batch matrix multiplication. I know cublasSgemmStridedBatched() method, but everywhere it is discussed said about small matrices. Tell me please, given my GPU card, how it is possible to properly calculate optimal batch size and matrices size? Now I have for example batch=16, size of one of the matrix A is 4x512, the second matrix B is the 512x512 size. The result matrix C is 16 matrices of 4x512 size. Maybe it is the other more proper way to do such kind of calculations. I tried streamed matrix multiplication before, but it hasn’t give me any gain in speed.
I have been working on my own neural network fraimwork in C++, maybe somebody would be interested in developing it with me. The main aim is research and publishing papers of Scopus level in objects detection field. The whole calculations are done through eigen matrices at the moment.