Hmm. Maybe I am going about this the wrong way then.
I have a bunch of host threads that require work done. Each thread represents a separate client.
I planned to also have a separate thread for each CUDA device. Then for each client I would select a device and route the CUDA calls through the appropriate thread for the device. This way, clients can change devices as necessary, to balance loads. But it seems that there is no way to block the client thread until processing is complete without blocking the thread that issues the CUDA calls to the device as well.
I want multiple separate threads to be able to make optimal use of the devices without blocking them, and with load balancing. And each originating thread needs to know when its compute job is done. What is the best way to achieve this?