Threading and streams cudaStreamSynchronize

Does [font=“Courier”]cudaStreamSynchronize[/font] have to be called from the same thread that launched the kernel / memcpy? Or will it work from any thread?

If it has to be the same thread, how can you wait for a stream to complete without blocking the host from issuing new requests in other streams.

Thanks.

It must be in the same CUDA context, so yes it must be in the same thread. If you need to synchronize to certain locations in the stream without blocking you can mark the location with a CUDA event and then perform a non-blocking check with cudaEventQuery.

So [font=“Courier”]cudaEventSynchronize[/font] can be called from any thread?

No. All cuda* functions operate in the current CUDA context. There is one context per host thread and contexts cannot share any resources. cudaEventQuery can be used for a non-blocking check if an event has occured.

Hmm. Maybe I am going about this the wrong way then.

I have a bunch of host threads that require work done. Each thread represents a separate client.

I planned to also have a separate thread for each CUDA device. Then for each client I would select a device and route the CUDA calls through the appropriate thread for the device. This way, clients can change devices as necessary, to balance loads. But it seems that there is no way to block the client thread until processing is complete without blocking the thread that issues the CUDA calls to the device as well.

I want multiple separate threads to be able to make optimal use of the devices without blocking them, and with load balancing. And each originating thread needs to know when its compute job is done. What is the best way to achieve this?