Documentation/example on new (non-threaded) method for multiple devices?

It used to be that in order to use multiple GPUs you had to run a separate host thread for each, a bit of a pain. Thankfully, recent versions of the CUDA dev system let multiple devices run on the same host thread. But I cannot find any thorough documentation or examples on how to do this. Before I put such code out into the public, I’d like to be reasonably sure I have not committed any subtle but dangerous sins. Is more information available anywhere? Thanks!

The programming guide discusses considerations for handling a system with multiple devices:

If you use CUDA dynamic parallelism or Unified Memory, those sections of the programming guide also have sections that discuss multi-gpu considerations.

And there are a number of CUDA sample codes that demonstrate how to use multiple devices, such as CUDA simple multi-gpu:

Finally there are a number of presentations recorded by NVIDIA as part of GTC and other events that are tutorials on multi-GPU usage, such as this one:

txbob - Thank you for all those links! That last one, the presentation, was particularly useful.