Documentation/example on new (non-threaded) method for multiple devices?

It used to be that in order to use multiple GPUs you had to run a separate host thread for each, a bit of a pain. Thankfully, recent versions of the CUDA dev system let multiple devices run on the same host thread. But I cannot find any thorough documentation or examples on how to do this. Before I put such code out into the public, I’d like to be reasonably sure I have not committed any subtle but dangerous sins. Is more information available anywhere? Thanks!

The programming guide discusses considerations for handling a system with multiple devices:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multi-device-system[/url]

If you use CUDA dynamic parallelism or Unified Memory, those sections of the programming guide also have sections that discuss multi-gpu considerations.

And there are a number of CUDA sample codes that demonstrate how to use multiple devices, such as CUDA simple multi-gpu:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-multi-gpu[/url]

Finally there are a number of presentations recorded by NVIDIA as part of GTC and other events that are tutorials on multi-GPU usage, such as this one:

[url]http://on-demand.gputechconf.com/gtc/2013/presentations/S3465-Multi-GPU-Programming.pdf[/url]

txbob - Thank you for all those links! That last one, the presentation, was particularly useful.

Tim