I am trying to do many small (256x256) matrix-matrix multiplications with CUBLAS *gemm function.
Because of the small size of the matrices the performacce is far from the optimal.
I tried to use several streams (up to 100) but it doesn’t improve the performance. It stays almost the same.
I use streams the following way:
cudaStream_t stream[streams_number];
for (int i = 0; i < streams_number; i++) {
cudaStreamCreate( &stream[i] ) );
}
for (int j = 0; j < nIter; j++) {
for (int i = 0; i < streams_number; i++) {
cublasSetStream( handle, stream[i] );
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WB, HA, WA, &alpha, d_B + size_B*i, WB, d_A + size_A*i, WA, &beta, d_C + size_C*i, WA);
}
}
Did I do it correctly?
Is there a better way to improve the multiplication of the small matrices?
Thank you,
Kirill
I am trying to do many small (256x256) matrix-matrix multiplications with CUBLAS *gemm function.
Because of the small size of the matrices the performacce is far from the optimal.
I tried to use several streams (up to 100) but it doesn’t improve the performance. It stays almost the same.
I use streams the following way:
cudaStream_t stream[streams_number];
for (int i = 0; i < streams_number; i++) {
cudaStreamCreate( &stream[i] ) );
}
for (int j = 0; j < nIter; j++) {
for (int i = 0; i < streams_number; i++) {
cublasSetStream( handle, stream[i] );
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WB, HA, WA, &alpha, d_B + size_B*i, WB, d_A + size_A*i, WA, &beta, d_C + size_C*i, WA);
}
}
Did I do it correctly?
Is there a better way to improve the multiplication of the small matrices?
Thank you,
Kirill
I believe you need an array of handles like you need an array of streams.
I believe you need an array of handles like you need an array of streams.