Is the driver API thread safe? Specifically the cuStreamSynchronize and cuEventSynchronize functions

This really seems like a chicken and an egg problem.

As long as you only have one memcpy or kernel running at any given time everything is simple right? You can create a worker thread with an asynchronous queue for each GPU on the system and just post jobs to run synchronously.

What if you want to monitor for the completion of a stream or an event or a synchronous GPU call, say because you want to call a CPU function immediately after the GPU is done with the data, while retaining the ability to start uploading new data to the GPU (say new data is arriving from the network/hard drive/etc that wasn’t available before) or queue up the next kernel launch before the completion of the previously queued request. Even ignoring overhead for the moment, think about it–is this even possible?

I have a lot of work that needs to be done on the CPU so I think any type of polling algorithm would vastly increase latency or burn too many cpu cycles (especially with all the user <-> kernel context switches going on, plus you would have to sleep 100-1000 times per second if you wanted to recover CPU time without using some icky non-preemptive multitasking scheme–and multiply all that by up to 4 GPUs/system)

Would it be possible for one of my CUDA threads to obtain a CUevent or CUstream object and let another thread call cuXxxSynchronize() on it? What if no changes are made to the CUevent/CUstream after passing it to the other thread? I know the same context cannot be current on two threads at the same time (i.e. you can’t do cuCtxCreate, cuCtxAttach, cuCtxPopCurrent, hand off CUcontext to another thread and cuCtxPushCurrent–this is because the ref count is not 1 when you cuCtxPopCurrent). A priori I can’t imagine this would cause any race conditions and this would allow one thread to queue CUDA calls while other threads wait for completion.

Creating 2+ contexts and worker threads per GPU isn’t really viable since a pretty big block of device memory would need to be visible to both contexts. Plus having to load modules twice and track multiple CUfunction pointers sounds pretty messy.

I’m just having a lot of trouble architecting my application and figured someone must have run into this problem before. Any ideas are very much appreciated!