Hello,
Assume I have N streams.
I launch N kernels using those N streams and they write to device memory.
// d_Counter[N] is an array of pointers.
for(int i =0; i <N; i++)
Kernel_1<< dimGrid, dimBlock, 0, stream[i]>>(d_Counter[N]);
I read the written values using the same N streams using CudaMemcpyAsync using the N streams and copy the results to the host.
//h_Counter is page locked memory.
for(int i =0; i <N; i++)
cudaMemcpyAsync((void*)(h_Counter + i), d_Counter[N], sizeof(unsigned int), cudaMemcpyDeviceToHost, stream[i]));
Based on the results I launch N kernels using the N streams.
for(int i =0; i <N; i++)
Kernel_2<< hCounter[i], dimBlock2, 0, stream[i]>>();
Then I call a cpu function.
cpuFUNC();
I have two questions.
-
Am I right in assuming that Kernel_2 will be launched only after hCounter[i] has the proper value that has been copied from device memory as streams are supposed to by synchronous? I am getting confused as I am using cudaMemcpyAsync. Do I need to call CudaStreamSynchronize(stream[i])?
-
If I want to overlap some CPU code with GPU code that is run cpuFUNC(); will it overlap only with Kernel2 or with Kernel_1, CudaMemcpyAsync and Kernel_2 ?
Thank you very much.