I want to use concurrent execution to calculate C = A + B,
Here A is calculated in GPU, B for some reason can be only calculated in CPU, and C = A + B is in GPU.
My codes looked like that:
Using cudaHostAlloc() with cudaHostAllocMapped to allocate the memories for B;
A <<< >>> ( … );
C = A + B;
I found that the time difference between time-stamp1 and time-stamp2 is the sum of execution time of A and B, do I make a wrong understanding on the way of concurrent execution?
I have checked asyncAPI, but it seems not help.
The vedio card is GTX 560Ti, and developing env is VS2008 + CUDA 5.0.