My application consists of 16 CPU process threads operating independently on a server with 2 GPUs. All processes need to do some of their processing on a GPU. What would be the best way to manage this? I’ve looked at the documentation for streams, threads (CUTThread), contexts, and atomics but I’m not sure what would best meet my needs. Any advice would be appreciated. Thanks.
You should have two threads that control the two GPUs. Each of those threads should dispatch a queue of GPU operations that the other threads post to. For many operations, this will be faster than floating contexts between the threads (using cuCtxPopCurrent/cuCtxPushCurrent, ect).
Personally, I have a virtual ‘CudaTask’ class that you inherit from and implement your initialization and kernel calls in. The task is then automatically scheduled to a GPU and posted on it’s queue, at which point the GPU thread dequeues and executes it. The GPU thread also has ‘running’ and ‘complete’ queues for synchronizing and/or querying the status of a task.
You can also take a look at the GPUWorker class that originated in HOOMD (although is no longer used by them), but has been incorporated into other projects:
GPUWorker.cc and .hh in that directory are what you want. The code depends on boost, and is unsupported, so you should read through it and make sure you know what it is doing.