This doesn’t look right to me:
float* device_A, device_B, device_C;
That is declaring device_A as a float pointer and device_B and device_C as ordinary float variables. If that is your real code, I’m not sure how you get that to compile.
I assume somewhere you have replaced the default allocator for cv::Mat with a pinned allocator, otherwise your observation should not occur, because the cudaMemcpyAsync operation becomes synchronous when transferring to a host pageable memory location.
Anyway, with the above provisos, the problem is not that the cudaMemcpyAsync operation is copying the wrong data, the problem is that your host code that is looking at the data to determine whether it is correct is not waiting for the cudaMemcpyAsync operation to complete. Remember, it is asynchronous (with the above provisos), so it is issued by the CPU thread, and the CPU thread immediately moves on, without waiting for the cudaMemcpyAsync to complete.
To prove this, put a cudaDeviceSynchronize or cudaStreamSynchronize after the cudaMemcpyAsync, before you look at the data in C.data in host code.
Here’s an example demonstrating these ideas:
$ cat t42.cu
#include <cublas_v2.h>
#include <iostream>
const int ds = 1024;
int main(){
float *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;
cudaMalloc(&d_A, ds*ds*sizeof(float));
cudaMalloc(&d_B, ds*ds*sizeof(float));
cudaMalloc(&d_C, ds*ds*sizeof(float));
#ifdef USE_PINNED
cudaHostAlloc(&h_A, ds*ds*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc(&h_B, ds*ds*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc(&h_C, ds*ds*sizeof(float), cudaHostAllocDefault);
#else
h_A = (float *)malloc(ds*ds*sizeof(float));
h_B = (float *)malloc(ds*ds*sizeof(float));
h_C = (float *)malloc(ds*ds*sizeof(float));
#endif
cublasHandle_t h;
cudaStream_t s;
cublasCreate_v2(&h);
for (int i = 0; i < ds*ds; i++){
h_A[i] = 1.0f;
h_B[i] = 1.0f;
h_C[i] = 0.0f;}
cudaStreamCreate(&s);
// Copy data from host to device
cudaMemcpy(d_A, h_A, sizeof(float) * ds * ds, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float) * ds * ds, cudaMemcpyHostToDevice);
// Matrix multipication
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(h, CUBLAS_OP_T, CUBLAS_OP_T, ds, ds, ds, &alpha, d_A, ds, d_B, ds, &beta, d_C, ds);
// Copy data from device to host asynchronously on a pre-defined stream.
cudaMemcpyAsync(h_C, d_C, sizeof(float) * ds * ds, cudaMemcpyDeviceToHost, s);
#ifdef USE_WAIT
cudaStreamSynchronize(s);
#endif
for (int i = 0; i < ds*ds; i++)
if (h_C[i] != (float)ds) {std::cout << "oops: " << i << " : " << h_C[i] << std::endl; return 0;}
return 0;
}
$ nvcc -o t42 t42.cu -lcublas
$ ./t42
$ nvcc -o t42 t42.cu -lcublas -DUSE_PINNED
$ ./t42
oops: 0 : 0
$ nvcc -o t42 t42.cu -lcublas -DUSE_PINNED -DUSE_WAIT
$ ./t42
$
Yes, cublas in this case runs on the default stream. This is easy to demonstrate with a profiler. Run the above code with
nvprof --print-gpu-trace …
and it will be evident.