Use streams to batch the execution of CUBLAS kernels.

I am trying to do many small (256x256) matrix-matrix multiplications with CUBLAS *gemm function.

Because of the small size of the matrices the performacce is far from the optimal.

I tried to use several streams (up to 100) but it doesn’t improve the performance. It stays almost the same.

I use streams the following way:

cudaStream_t stream[streams_number];

for (int i = 0; i < streams_number; i++) {

      cudaStreamCreate( &stream[i] ) );

}

for (int j = 0; j < nIter; j++) {

     for (int i = 0; i < streams_number; i++) {

	cublasSetStream( handle, stream[i] );

	cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WB, HA, WA, &alpha, d_B + size_B*i, WB, d_A + size_A*i, WA, &beta, d_C + size_C*i, WA);

     }

}

Did I do it correctly?

Is there a better way to improve the multiplication of the small matrices?

Thank you,

Kirill

I am trying to do many small (256x256) matrix-matrix multiplications with CUBLAS *gemm function.

Because of the small size of the matrices the performacce is far from the optimal.

I tried to use several streams (up to 100) but it doesn’t improve the performance. It stays almost the same.

I use streams the following way:

cudaStream_t stream[streams_number];

for (int i = 0; i < streams_number; i++) {

      cudaStreamCreate( &stream[i] ) );

}

for (int j = 0; j < nIter; j++) {

     for (int i = 0; i < streams_number; i++) {

	cublasSetStream( handle, stream[i] );

	cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WB, HA, WA, &alpha, d_B + size_B*i, WB, d_A + size_A*i, WA, &beta, d_C + size_C*i, WA);

     }

}

Did I do it correctly?

Is there a better way to improve the multiplication of the small matrices?

Thank you,

Kirill

I believe you need an array of handles like you need an array of streams.

I believe you need an array of handles like you need an array of streams.