multi GPU multithread OK. Putting data back together NOT SO MUCH.

Community,

I am running a system which processes a stack of images.
I crop each frame as it loads inside of ROUTINE and assign a portion of each image to each of the processors.
I have threaded the system such that each GPU is tied to a single CPU thread.

I start the thread and end the thread like the example:
for( int device = 0; device < num_gpus; device++ )
{
threadID[device] = cutStartThread( ROUTINE )
}
cutWaitForThreads(threadID, num_gpus);

How would I be able to send the data back to the CPU without exiting the ROUTINE?
Is there a way for me to send some flags once “cudamemcpy” is called in a thread such that the CPU knows?

I have no clue how to do this.

I dont want to re-thread and recall the ROUTINE for each frame because there are many allocations of memory which bogs down run time.

HELP!

Thanks.

BTW im a novice at multithreading.