I’ve been attempting to get multiple devices working well together, but it has resulted in a all out war against the Windows thread schedular, particularily since I have more devices than CPU cores.
I’ve managed to get fairly high efficiency by using cudaDeviceScheduleYield and juggling thread priorities, but it still remains a battle getting Windows to schedule the right threads, rather than constantly rescheduling the thread that just yielded. This results in starved GPUs when the associated control thread is not scheduled in a timely fashion.
This brings me to my question - does a win32 fiber retain enough context to maintain a CUDA device context? Alternately, is there a way to control thread scheduling in a cooperative fashion? 30ms latency for threads to be scheduled just doesn’t cut it…