Asynchronous performance between CPU and GPU

pg3 · June 8, 2012, 4:15pm

Hi all,

I’m currently working on performance regarding asynchronous mechanism between 1 GPU and 1 CPU.

Goal is to increase time performance by using CPU processing while GPU is working.

Here is how it works:
* CPU send datas_1 regarding a kernel on the GPU
* CPU ask GPU to execute this kernel
* while kernel is beeing processed with datas_1, CPU is sending datas_2 in order to be process by the kernel.

Problem is that I get low performance.
Indeed, the total time of execution is almost the same as data transfer time + kernel execution.

(If data transfer + kernel execution are asynchronous, the total time should be shorter(?))

Do you have any solutions to make a real asynchronous execution?
Did anymore make tests regarding this scope: 1 GPU/ 1 CPU? did you came to the same results?
Is this conclusion correct?

Thanks

JeremiahPalmer · June 8, 2012, 6:00pm

Could you post some code?

Hi all,

I’m currently working on performance regarding asynchronous mechanism between 1 GPU and 1 CPU.

Goal is to increase time performance by using CPU processing while GPU is working.

Here is how it works:
            * CPU send datas_1 regarding  a kernel on the GPU

            * CPU ask GPU to execute this kernel

            * while kernel is beeing processed with datas_1, CPU is sending datas_2 in order to be process by the kernel.
Problem is that I get low performance.

Indeed, the total time of execution is almost the same as data transfer time + kernel execution.

(If data transfer + kernel execution are asynchronous, the total time should be shorter(?))

Do you have any solutions to make a real asynchronous execution?

Did anymore make tests regarding this scope: 1 GPU/ 1 CPU? did you came to the same results?

Is this conclusion correct?

Thanks

pg3 · June 18, 2012, 12:24pm

Currently, to realise my asynchronous program, I initialize two arrays (h_A and h_B) which include N matrix (of size MxM).

After that, I make an asynchronous (with overlapping) system which :

load datas from CPU to GPU : only one matrix A(MxM) (a piece of h_A) and one matrix B(MxM) (a piece of h_B)
execute a kernel on GPU (for those matrix)
return result on CPU

I execute â€œMâ€ times this system.

Here is my code:

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(N / threads.x, N / threads.y);

cudaStream_t stream[nb_iter];

cudaEvent_t start;

cudaEvent_t stop;

float elapsed;

for(int i=0; i<nb_iter; i++){

  cudaStreamCreate(&stream[i]);

}

cudaEventCreate(&start);

cudaEventCreate(&stop);

//allocate host memory

float *h_A = tabMatrixAlloc( N, nb_iter );

float* h_B = tabMatrixAlloc( N, nb_iter );

float* h_C = tabMatrixAlloc( N, nb_iter );

//allocate device memory

float* d_A, *d_B, *d_C;

cudaMalloc((void**) &d_A, getMatrixSize(N)*nb_iter);

cudaMalloc((void**) &d_B, getMatrixSize(N)*nb_iter);

cudaMalloc((void**) &d_C, getMatrixSize(N)*nb_iter);

tabMatrixInit(h_A, N, nb_iter);

  tabMatrixInit(h_B, N, nb_iter);

cudaEventRecord(start);

  for (int var = 0; var < nb_iter; ++var) {

        cudaMemcpyAsync(getMatrix( d_A, N, var), getMatrix( h_A, N, var), getMatrixSize(N), cudaMemcpyHostToDevice ,stream[var]);

        cudaMemcpyAsync(getMatrix( d_B, N, var), getMatrix( h_B, N, var), getMatrixSize(N), cudaMemcpyHostToDevice ,stream[var]);

  }

for (int var = 0; var < nb_iter; ++var) {

        matrixMul<<<grid, threads, 0 ,stream[var]>>>(getMatrix( d_C, N, var), getMatrix( d_A, N, var), getMatrix( d_B, N, var), N);

  }

for (int var = 0; var < nb_iter; ++var) {

        cudaMemcpyAsync(getMatrix( h_C, N, var), getMatrix( d_C, N, var), getMatrixSize(N), cudaMemcpyDeviceToHost ,stream[var]);

  }

  cudaEventRecord(stop);

  cudaEventSynchronize(stop);

  cudaEventElapsedTime(&elapsed,start,stop);

  // clean up memory

free(h_A);free(h_B);free(h_C);

cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

cudaEventDestroy(start); cudaEventDestroy(stop);

//Destroy Stream

for(int i=0; i<nb_iter; i++){cudaStreamDestroy(stream[i]);}

//Reset du gpu

cudaDeviceReset();

return elapsed;

FYI:

getMatrix() : create a matrix from a table

getMatrixSize(): return the matrix size

MKasper · June 18, 2012, 12:36pm

check out slides 21 and 22…
if you are working on a fermi device this could be your problem. Easy to test by changing to depth first issuing. Streams and performance optimization is sometimes a tricky topic…

Topic		Replies	Views
Streams and CPU CUDA Programming and Performance	1	1032	September 27, 2013
\|\| programming, basic question CUDA Programming and Performance	18	1293	April 30, 2018
Concurrent Kernel executions Concurrent Kernel executions on same CPU thread and multiple CPU threa CUDA Programming and Performance	2	4170	August 25, 2011
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	491	December 17, 2017
How to check work is done by different GPU in multi GPU environment CUDA Programming and Performance	8	3003	June 18, 2009
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3746	November 6, 2018
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20147	May 4, 2007
My streams are not running concurrently CUDA Programming and Performance	7	1789	March 6, 2018
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6003	September 8, 2009

Asynchronous performance between CPU and GPU

Related topics