Running streams parallel with the host functions

I have wrote a program which has two streams. Both streams operate on some data and write results in the form of flags on the host memory.
Here is the generic structure of how i am doing this:

loop {
AsyncCpy(....HostToDevice,Stream1);
AsyncCpy(....HostToDevice,Stream2);

Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>

/* Write the results on the host memory */
AsyncCpy(....DeviceToHost,Stream1);  
AsyncCpy(....DeviceToHost,Stream2);  
}

I want to do some work on the CPU once i know that StreamX has finished copying the results back to the host memory. At the same time, i don’t want to stop the loop from executing Async operations (memcpy or kernel execution).

If i insert my host functions, let say host_ftn1(…) and host_ftn2(…) like this

loop {
AsyncCpy(....HostToDevice,Stream1);
AsyncCpy(....HostToDevice,Stream2);

Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>

/* Write the results on the host memory to be processed by host_ftn1(..) */
AsyncCpy(....DeviceToHost,Stream1);
/* Write the results on the host memory to be processed by host_ftn2(..) */
AsyncCpy(....DeviceToHost,Stream2);  

if(Stream1 results are copied to host)
       host_ftn1(..);
if(Stream2 results are copied to host)
       host_ftn2(..);
}

It will stop the execution of loop until it finishes the execution of host functions i.e. host_ftn1 and host_ftn2, but I don’t want to stop the execution of GPU instructions i.e. AsyncCpy(…) and Kernel<<<…,StreamX>>> while the CPU is executing host functions.

Any solution/approach regarding this problem

investigate stream callbacks

it’s documented in the CUDA C programming guide, and there are sample codes as well

I have read stream callbacks and it seems like it will work in my case. But there is one problem.

The cudaStreamAddCallback layout let me pass only a single data pointer

__host__ ​cudaError_t cudaStreamAddCallback ( cudaStream_t stream, cudaStreamCallback_t callback, void* userData, unsigned int  flags )

But in my case there are various host variables which i want to pass to the callback function. One solution could be declaring all those variables global, but that would make my code messy.

Any solution ?

userdata could be a pointer to a struct of pointers to whatever data you want

study some of the cuda sample codes for cuda callbacks, or just study some codes that use pthreads

Pointer-to-struct will work.

Though callbacks section is very little documented in Cuda C programming guide. There is only one sample code given.

I have tested stream callbacks for two streams in my program and it worked.
thank you :)