Hi,
we have to multiply many small matrices with themselves currently we are using cublasDgemmBatched which gives us reasonable performance.
However, because the matrices are multiplied with themselves we want to improve the performance using cublasDsyrk.
We tried to write our own batchedDsyrk using CUDA streams and the cublasDysrk function but we observed extreme low performance with this approach.
With the cublasDgemmBatched version the program takes several seconds to finish, with our custom batchedDsyrk suddenly the program needs several minutes to finish.
We think we are doing something wrong, but we are not able to find our mistake. The matrices have a size of 32x16 (output matrices are square matrices of size s1). We are using the P100 GPU.
The cublasDgemmBatched call looks like this:
cublasDgemmBatched(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, s1, s1, batch_size,
&bgemm_alpha[0], (const double**) d_map, vec_size,
(const double**) d_map, vec_size,
&bgemm_beta[0] , d_mcp, s1, nm[0]);
Out custom batchedDsyrk looks like this (the streams are only created once before the main loop and then reused several times):
const int stream_n = 128;
cudaStream_t *streams = (cudaStream_t*)malloc(stream_n*sizeof(cudaStream_t));
for(int i=0;i<stream_n;i++)
{
CHECK(cudaStreamCreateWithFlags(&streams[i], cudaStreamNonBlocking));
}
for(int i=0;i<nm[0];i++)
{
cublasSetStream(cublas_handle, streams[i % stream_n]);
cublasDsyrk(cublas_handle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_T, s1, batch_size,
&bgemm_alpha[0], map[i], vec_size, &bgemm_beta[0], mcp[i], s1);
}
cudaDeviceSynchronize();
cublasSetStream(cublas_handle,NULL);
Thank you for your help