Asynchronous performance between CPU and GPU

Hi all,

I’m currently working on performance regarding asynchronous mechanism between 1 GPU and 1 CPU.

Goal is to increase time performance by using CPU processing while GPU is working.

Here is how it works:
* CPU send datas_1 regarding a kernel on the GPU
* CPU ask GPU to execute this kernel
* while kernel is beeing processed with datas_1, CPU is sending datas_2 in order to be process by the kernel.

Problem is that I get low performance.
Indeed, the total time of execution is almost the same as data transfer time + kernel execution.

(If data transfer + kernel execution are asynchronous, the total time should be shorter(?))

Do you have any solutions to make a real asynchronous execution?
Did anymore make tests regarding this scope: 1 GPU/ 1 CPU? did you came to the same results?
Is this conclusion correct?


Could you post some code?

Currently, to realise my asynchronous program, I initialize two arrays (h_A and h_B) which include N matrix (of size MxM).

After that, I make an asynchronous (with overlapping) system which :

  • load datas from CPU to GPU : only one matrix A(MxM) (a piece of h_A) and one matrix B(MxM) (a piece of h_B)

  • execute a kernel on GPU (for those matrix)

  • return result on CPU

I execute “M” times this system.

Here is my code:

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(N / threads.x, N / threads.y);

cudaStream_t stream[nb_iter];

cudaEvent_t start;

cudaEvent_t stop;

float elapsed;

for(int i=0; i<nb_iter; i++){





//allocate host memory

float *h_A = tabMatrixAlloc( N, nb_iter );

float* h_B = tabMatrixAlloc( N, nb_iter );

float* h_C = tabMatrixAlloc( N, nb_iter );

//allocate device memory

float* d_A, *d_B, *d_C;

cudaMalloc((void**) &d_A, getMatrixSize(N)*nb_iter);

cudaMalloc((void**) &d_B, getMatrixSize(N)*nb_iter);

cudaMalloc((void**) &d_C, getMatrixSize(N)*nb_iter);

tabMatrixInit(h_A, N, nb_iter);

  tabMatrixInit(h_B, N, nb_iter);


  for (int var = 0; var < nb_iter; ++var) {

        cudaMemcpyAsync(getMatrix( d_A, N, var), getMatrix( h_A, N, var), getMatrixSize(N), cudaMemcpyHostToDevice ,stream[var]);

        cudaMemcpyAsync(getMatrix( d_B, N, var), getMatrix( h_B, N, var), getMatrixSize(N), cudaMemcpyHostToDevice ,stream[var]);


for (int var = 0; var < nb_iter; ++var) {

        matrixMul<<<grid, threads, 0 ,stream[var]>>>(getMatrix( d_C, N, var), getMatrix( d_A, N, var), getMatrix( d_B, N, var), N);


for (int var = 0; var < nb_iter; ++var) {

        cudaMemcpyAsync(getMatrix( h_C, N, var), getMatrix( d_C, N, var), getMatrixSize(N), cudaMemcpyDeviceToHost ,stream[var]);





  // clean up memory



cudaEventDestroy(start); cudaEventDestroy(stop);

//Destroy Stream

for(int i=0; i<nb_iter; i++){cudaStreamDestroy(stream[i]);}

//Reset du gpu


return elapsed;


getMatrix() : create a matrix from a table

getMatrixSize(): return the matrix size
check out slides 21 and 22…
if you are working on a fermi device this could be your problem. Easy to test by changing to depth first issuing. Streams and performance optimization is sometimes a tricky topic…