CUDA GPU and CPU synchronisation - how? How to make the CPU wait for all GPU threads without ending


I could use some help …

I have a simple CUDA ray tracing kernel producing graphic output. On each rendered frame this kernel is run once by the engine. Some data in each kernel-running thread (~1 Milion threads total) is reused between kernel executions – thus at the end of ray tracing of each frame each thread needs to store significant amount of data in global memory which it will use in the rendering of next frame. Thus, between each two frames there’s a huge traffic to and from global memory – I would like to avoid that. Additionally after each frame some data in device’s global memory (used by all threads) needs to updated by the CPU.

Is there a way to implement my ray tracing algorithm in a manner similar to the one below?

// GPU:

__global__ void ray_trace_kernel(){




barrier_1(); // Notify CPU that it's safe to update data in main memory

     barrier_2(); // wait for CPU to upload new data to global memory







barrier_1(); // wait for  ALL GPU THREADS to hit barrier_1

   update_data_in_global_memory(); // update data in GPU's memory

   barrier_2(); // wait for ALL GPU THREADS to hit barrier_2


How can I implement the barriers? Or maybe someone could advice some different way to tackle this problem? Is it doable?

I do hope I was clear enough on the description. I use CUDA driver API, but solution in runtime API would also be helpful. I work with Compute Capability 2.1 (Fermi, GF104).

ANY tips will be much welcomed.