Does cublasSgemm executes on default stream ?

Here is my code:

cv::Mat A, B, C;
float* device_A, device_B, device_C;
// Initilize host memory and allocate device memory here
// ...


// Copy data from host to device
cudaMemcpy(device_A,, sizeof(float) * A.rows * A.cols, cudaMemcpyHostToDevice);
cudaMemcpy(device_B,, sizeof(float) * B.rows * B.cols, cudaMemcpyHostToDevice);

// Matrix multipication
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_T, A.rows, B.cols, A.cols, &alpha, device_A, A.cols, device_B, B.cols, &beta, device_C, A.rows);

// Copy data from device to host asynchronously on a pre-defined stream.
cudaMemcpyAsync(, device_C, sizeof(float) * C.rows * C.cols, cudaMemcpyDeviceToHost, stream);

I find this code will produce error data. I guess that’s because the


copies data before


have finished.

But I learned that kernel running on default stream will be implicitly synchronized with all other streams. Shouldn’t


runs on default stream ? Why the error results occurs ?

This doesn’t look right to me:

float* device_A, device_B, device_C;

That is declaring device_A as a float pointer and device_B and device_C as ordinary float variables. If that is your real code, I’m not sure how you get that to compile.

I assume somewhere you have replaced the default allocator for cv::Mat with a pinned allocator, otherwise your observation should not occur, because the cudaMemcpyAsync operation becomes synchronous when transferring to a host pageable memory location.

Anyway, with the above provisos, the problem is not that the cudaMemcpyAsync operation is copying the wrong data, the problem is that your host code that is looking at the data to determine whether it is correct is not waiting for the cudaMemcpyAsync operation to complete. Remember, it is asynchronous (with the above provisos), so it is issued by the CPU thread, and the CPU thread immediately moves on, without waiting for the cudaMemcpyAsync to complete.

To prove this, put a cudaDeviceSynchronize or cudaStreamSynchronize after the cudaMemcpyAsync, before you look at the data in in host code.

Here’s an example demonstrating these ideas:

$ cat
#include <cublas_v2.h>
#include <iostream>

const int ds = 1024;

int main(){

  float *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;
  cudaMalloc(&d_A, ds*ds*sizeof(float));
  cudaMalloc(&d_B, ds*ds*sizeof(float));
  cudaMalloc(&d_C, ds*ds*sizeof(float));
  cudaHostAlloc(&h_A, ds*ds*sizeof(float), cudaHostAllocDefault);
  cudaHostAlloc(&h_B, ds*ds*sizeof(float), cudaHostAllocDefault);
  cudaHostAlloc(&h_C, ds*ds*sizeof(float), cudaHostAllocDefault);
  h_A = (float *)malloc(ds*ds*sizeof(float));
  h_B = (float *)malloc(ds*ds*sizeof(float));
  h_C = (float *)malloc(ds*ds*sizeof(float));
  cublasHandle_t h;
  cudaStream_t s;
  for (int i = 0; i < ds*ds; i++){
    h_A[i] = 1.0f;
    h_B[i] = 1.0f;
    h_C[i] = 0.0f;}
  // Copy data from host to device
  cudaMemcpy(d_A, h_A, sizeof(float) * ds * ds, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, sizeof(float) * ds * ds, cudaMemcpyHostToDevice);

  // Matrix multipication
  float alpha = 1.0f, beta = 0.0f;
  cublasSgemm(h, CUBLAS_OP_T, CUBLAS_OP_T, ds, ds, ds, &alpha, d_A, ds, d_B, ds, &beta, d_C, ds);

  // Copy data from device to host asynchronously on a pre-defined stream.
  cudaMemcpyAsync(h_C, d_C, sizeof(float) * ds * ds, cudaMemcpyDeviceToHost, s);
#ifdef USE_WAIT
  for (int i = 0; i < ds*ds; i++)
    if (h_C[i] != (float)ds) {std::cout << "oops: " << i << " : " << h_C[i] << std::endl; return 0;}
  return 0;
$ nvcc -o t42 -lcublas
$ ./t42
$ nvcc -o t42 -lcublas -DUSE_PINNED
$ ./t42
oops: 0 : 0
$ nvcc -o t42 -lcublas -DUSE_PINNED -DUSE_WAIT
$ ./t42

Yes, cublas in this case runs on the default stream. This is easy to demonstrate with a profiler. Run the above code with

nvprof --print-gpu-trace …

and it will be evident.