What is the recommended way to flag secondary threads

Suppose I have a thread that is executing a 10 stage Cuda pipeline. After each stage I want to copy some memory from the GPU to the CPU. My original intent was to use a secondary thread for the copies, where after each stage a cuda event would be triggered to let the secondary thread know that it was safe to copy the data back. But it appears that events don’t work that way (if I am understanding them at least)

My second idea was to use cudaLaunchHostFunc() inside the main GPU thread to submit a function to a secondary thread to copy the data back instead. But as per the Cuda Programming Guide it tells you to not call any Cuda APIs within this function or risk waiting on itself.

So now I am at a bit of a loss. What is the best way to trigger CPU code that needs to use the API for things like copying while leaving the GPU thread to execute as fast as it can?

Thanks

cudaEvents and / or host functions are the way to go. At which point to you face difficulties? What does your current code look like?

I can’t share actual code unfortunately since it’s code for my employer and confidential (which makes getting help generally a lot harder). But I can try some pseudo code. Each stage follows the same structure of:

func pipeline_stage():
    pre_process()
    process()
    post_process()

Where the process function is doing all the heavy lifting with regards to Cuda and pre/post_process are doing most of the copying to and from the pipeline stage. Right now we do standard synchronous memory transfers which I know we need to change to use cudaMemcpyAsync at some point. But since we also do CPU work with the data especially in the post_process() function we need someway to trigger CPU code as well as the memory transfer

That was why I was hoping to have a post_process() function that we could run on a secondary thread via cudaLaunchHostFunc() that performs the memory transfer and operates on the data before returning it to the customer. But that would require calling cudaMemcpyAsync from within the cudaLaunchHostFunc() which the guide tells you not to do.

With pinned memory and asynchronous memory copies, you can start the transfers from the “main” process thread. Then record an event after the copy, or launch a host function. Depends whether you want to do the actual post processing work in the internal host function thread.

Here is some pseudo code to give you some idea:

int* h_ptr; int *d_ptr; size_t bytes;
kernel<<<stream>>>(d_ptr);
cudaMemcpyAsync(h_ptr, d_ptr, bytes, cudaMemcpyDeviceToHost, stream);

//option 1
cudaEventRecord(event, stream);
// post processing thread can do cudaEventSynchronize(event) to wait until transfer to host is complete

//option 2
auto callbackfunc = ...
cudaLauncHostFunc(stream, callbackfunc, args);
//host func will begin after transfer to host is complete. either use the callback to 
//notify the post processing thread, or do the post processing within the callback

Thanks for the examples. Option 2 immediately stands out the as the easier of the two options to manage thread wise. But I have a question about option 1. The chances are that the post-processing thread will already exist and be waiting to execute and the event variable will already exist as probably a member variable of that pipeline stage. That means that there is a chance that it reaches cudaEventSynchronize(event) before the process function reaches cudaEventRecord(event, stream);. Will the post processing thread just sit idle until cudaEventRecord(event, stream); has been hit? Might be a dumb question, but just want to make sure

No, if the second thread reaches eventSynchronize before it is recorded, the old status of the event will be used.

You can use option 2 to notify an already existing thread. For example, let the second thread wait on some condition variable (or busy waiting) , then in the host function callback, send a signal to the second thread to stop waiting.

No, if the second thread reaches eventSynchronize before it is recorded, the old status of the event will be used.

I see, is there a way to prime an event with a “wait here” condition? Based on my current understanding at least, if I was going to use option one I think it would be best if every pipeline_stage starts with a “wait here” event since nothing outside the pipeline_stage should need access to it. If not it’s not the end of the world. There’s always option 2

You can use option 2 to notify an already existing thread. For example, let the second thread wait on some condition variable (or busy waiting) , then in the host function callback, send a signal to the second thread to stop waiting.

That was actually my stage 2 optimisation for post process was to make a job queue. So the post process function would just be used to submit the function to a job queue and a threadpool would handle processing them all

One difference between the options is that cudaEventRecord does not automatically block the stream which recorded the record.

Whereas cudaLaunchHostFunc will block the stream, until the callback returns.

However, both options are flexible:

  • Option 1: You could use a second event in option 1, which you use in the opposite direction to signal to the stream that it can continue (the stream would have to explicitly wait for this other event).
  • Option 2: As striker mentioned, you can signal from within the callback to a waiting thread and return from the callback at once, before the second thread finishes.

There isn’t. The warning given by striker159 needs to be kept in mind: if you attempt an event query or synchronization on an event not yet recorded, you’re not likely to get expected behavior.

The event will be recorded “somewhere” i.e. in some thread. Have that thread use whatever interthread communication you would like to implement the “wait here” in host code. That’s not a CUDA question. After the “wait here” elapses, then you can go on to sampling the event status with cudaEventQuery in a polling loop, if you find that attractive.

Initialization:
wait_here = 1;

Recording thread:
cudaEventRecord(…);
wait_here = 0;

polling thread:
while(wait_here) {};
while (cudaEventQuery != cudaSuccess) {};
do_post_processing_work();

That additional context is very useful. With that I think I can make a proper design decision. Thanks :)