Best practice: 1 host thread per device?

Assuming that you want to do complex device-to-device communication and synchronization, is it generally better to have 1 host thread running per device, or to manage all of the devices from a single host thread?

Is the additional complexity of having host non-determinism buying you anything?

things i would keep in mind when faced with “host non-determinism”

a) all host threads would be subject to the OS; too many host threads may then be a bad thing

b) the amount of work done by each host thread - it may happen that the host is too slow for the device(s), and can not keep up; or it may happen that too many host threads sleep most of the time

c) the level of synchronization between work done by the device(s) - i can think of both cases where the tasks in different streams or on different devices are closely linked, and cases where the tasks in different streams or on different devices are hardly linked; the former may show little benefit from multiple host threads then, whereas the latter may very well benefit from multiple host threads

This part is unclear to me, actually. If I launch N host threads, one per device, and those threads do fairly little themselves (they launch kernels, start async memcpy, or synchronize events), htop always shows the process as consuming N * 100% of CPU time.

This suggests that the CPU resources are important. However, if for the same task, I launch 1 host thread, I end up using 1/Nth as much CPU time, but the overall wall time to run the program is largely unaffected.

So, how much work do the host threads actually do?

the point regarding the amount of work done by a host thread relates to the work a host thread needs to do, other than waiting - when it is not waiting

you mentioned 1 extreme: “threads do fairly little themselves (they launch kernels, start async memcpy, or synchronize events)”

the other extreme is where the host (thread(s)) need to do (significant) post-processing of work completed by device kernels, and the host generally tends to numerous devices

the former may call for fewer host threads, and the latter more threads, conditional on some of the other factors mentioned

some of the synchronization apis have flags that determine whether synchronization would result in busy wait or not

“N * 100% of CPU time” may be due to busy waiting, or the monitoring tool’s inability to properly track child threads of a process; cross-check with a system level monitoring tool (psutil, etc), and/ or alternatives to htop (atop, vtop, etc)

Thanks! (I decided to start a new thread as a follow-up, for clarity)