I have an application that runs on multiple cores utilizing MPI. I have a CUDA kernel function that any given core needs to be able to call at an unspecified time.
Basically, the GPU needs to act as a service accessible at any time by any user (host CPU core). If the GPU is already in use, a requesting core will obviously have to pause and wait for its turn.
Is this functionality built into CUDA or will I need to use something like a semaphore to make it work?
Also, can anyone point me to any academic or whitepapers that describe using the GPU in this fashion? What is the relationship of each core to the GPU? Is one the master, or does each have equal access?
This isn’t the usual multithread/process model for CUDA. Usually there is a 1:1 correspondence between threads/processes and the number of active GPUs. I don’t believe you are going to be able to do this just using the CUDA driver or runtime API.
Presumably the kernel launch is associated with other operations like data transfers from MPI processes to the GPU and back again, which complicates the situation considerably. The best solution is probably to have a “compute” process holding a GPU context and listening for requests in the MPI communicator. It could then receive data, launch the kernel and return results to the calling MPI process. I haven’t seen anything like this reported in the literature or discussed here.
The short answer is that the functionality is built into CUDA. I do not know exactly how the driver schedules kernels from different host applications, but I believe that kernels execute to completion without preemption, meaning that it is possible for a long running kernel to hog the GPU for a significant amount of time. I have not seen any results that indicate whether or not the CUDA runtime schedules kernels from different applications using a fair algorithm or just launches them in a first come first served manner…
Note that my answer assumes each host thread runs as a separate process. It is possible to achieve the same functionality using a threading library such as pthreads, but it requires some care when device selection is done. As long as you make sure that you do not call any cuda function from the main thread until you create the worker threads and call setDevice from within each worker thread, you should be fine.
One thing to note is that switching the GPU between different CUDA contexts is not a lightweight operation; you will see a performance penalty on the GPU side compared to having a single MPI client handle all interaction with a particular GPU. Plus, since blocking sync is available, you could have N MPI processes on an N-core machine where one of the processes has an extra thread to interact with the GPU. Even though you’re oversubscribing the CPU, it won’t really matter because the extra thread will be asleep the vast majority of the time.