I am trying to do many small (256x256) matrix-matrix multiplications with CUBLAS *gemm function.

Because of the small size of the matrices the performacce is far from the optimal.

I tried to use several streams (up to 100) but it doesn’t improve the performance. It stays almost the same.

I use streams the following way:

```
cudaStream_t stream[streams_number];
for (int i = 0; i < streams_number; i++) {
cudaStreamCreate( &stream[i] ) );
}
for (int j = 0; j < nIter; j++) {
for (int i = 0; i < streams_number; i++) {
cublasSetStream( handle, stream[i] );
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WB, HA, WA, &alpha, d_B + size_B*i, WB, d_A + size_A*i, WA, &beta, d_C + size_C*i, WA);
}
}
```

Did I do it correctly?

Is there a better way to improve the multiplication of the small matrices?

Thank you,

Kirill