I am trying to allow the same host process to control multiple devices simultaneously using the driver level api. I am trying to decide whether or not I need to use multiple threads or if I can do the same thing from a single thread.
Let’s say that I launch a kernel asynchronously and then immediate perform a context switch via cuCtxPopCurrent and cuCtxPushCurrent and launch another kernel asynchronously on a different GPU. Can these two kernels run concurrently?
i’m having a lost of problem running my GPU code under windows 7. Under linux I can get easily 45Gflops on my Tesla (with is good for this application…) but on windows i’m stuck at 7Gflops.
I use a lot of small kernel (i know this is bad …) and i read something about WDDM which increase latency at each kernel call (i read 40us instead of 3us !!)
Any clue, anything possible to speedup windows execution ?