cublasSgemm launches kernel in wrong stream


I am integrating CUDA streams in my application. As a matter of experiment I am trying to execute all kernels in a stream different than NULL. However, I noticed that a call to cublasSgemm sometimes launches an instance of “gemm_kernel_1x1_core” in a different stream than specified. Here’s a screenshot of NVVP:

This is the only occurrence of a kernel running in a different stream. All other CUDA kernels run in the correct stream.

As mentioned before, this does not always happen, but it definitely does when the matrices become large. I have written unit tests and there the problem does not seem to occur. The result of this behaviour is that I’m getting crap values later down my pipeline when running on a larger dataset.

The code that produces this is pretty vanilla (scalars are stored on the device and the pointer mode is set as such):

// stream is properly instantiated earlier in the code.
    CheckErrors(cublasSetStream(handle, stream));

    // m = number of rows of first_op
    // n = number of rows of second_op
    // k = number of cols of first_op and second_op
    // dst becomes m-by-n matrix

        m, n, k,
        first_op, m,
        second, n,
        dst, m));

    CheckErrors(cublasSetStream(handle, NULL));

Even if I drop the last line which sets the stream back to NULL, the problem persists.

To me this seems as a bug in CuBLAS. Does anyone know how I can solve this?


Without minimal, but complete (buildable and runnable) repro code, it is highly unlikely anybody will be able to reproduce this issue and diagnose it as either a bug in your code or in CUBLAS.

If you believe that you have done sufficient due diligence to establish that this you are encountering a bug in CUBLAS, you may want to consider filing a bug report with NVIDIA using the bug reporting form linked from the CUDA registered developer website.