I am integrating CUDA streams in my application. As a matter of experiment I am trying to execute all kernels in a stream different than NULL. However, I noticed that a call to cublasSgemm sometimes launches an instance of “gemm_kernel_1x1_core” in a different stream than specified. Here’s a screenshot of NVVP: http://chri.stophr.be/nvvp-cuBlas.png
This is the only occurrence of a kernel running in a different stream. All other CUDA kernels run in the correct stream.
As mentioned before, this does not always happen, but it definitely does when the matrices become large. I have written unit tests and there the problem does not seem to occur. The result of this behaviour is that I’m getting crap values later down my pipeline when running on a larger dataset.
The code that produces this is pretty vanilla (scalars are stored on the device and the pointer mode is set as such):
// stream is properly instantiated earlier in the code. CheckErrors(cublasSetStream(handle, stream)); // m = number of rows of first_op // n = number of rows of second_op // k = number of cols of first_op and second_op // dst becomes m-by-n matrix CheckErrors(cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, first_op, m, second, n, &beta, dst, m)); CheckErrors(cublasSetStream(handle, NULL));
Even if I drop the last line which sets the stream back to NULL, the problem persists.
To me this seems as a bug in CuBLAS. Does anyone know how I can solve this?