I am trying to use multiple GPU devices (say deviceid 0 and 1) from a single host thread. I create 2 CUcontexts at the beginning one for each device (0 and 1). The place where the runtime APIs (cudamalloc, cudamemcpy, kernel launch etc.) are called, I try to switch between different CUcontexts that were created for each device. For switching, I use ctxPushCurrent and ctxPopCurrent around 1 or set of cudaRuntime APIs. But it seems like I cannot switch between device contexts. I have manually written code in such a way that two devices are operating on different set of memories. However, what I am experimenting is the use of multiple devices from a single host thread.
Please share your ideas, if you did any experiments on this topic.