Global memory space and most of the CUDA API state are encapsulated within CUDA contexts. (I.e. global addresses are local to CUDA contexts) Currently there is no way to do device-device transfer between CUDA contexts, other than doing round-trip through host(CPU) memory.
The correspondence between CUDA contexts and (host)CPU threads is one-to-one.
Currently each context is bound to the creator thread, but context migration feature will be available soon.