i have several CPU threads that need to execute a CUDA kernel, and i have several CUDA devices.
I’d like to know how to manage this, what are the tools provided by CUDA, how to choose on which device each thread will execute his kernel, if there there is a way to know if a device is currently used etc…
CUDA doesn’t provide anything specific to handle multithreading. The traditional model is that each thread explicitly creates a context on its given device, subsequent CUDA operations then require no changes to run within that context.
hi, thank you for your answer. so i guess i’ll have to make some kind of “routing” class to share the threads between the devices.
do you know if threre is a way to monitor the device activity to be able to assign the next free device to the next idling thread?
Prior to CUDA 4.0 it isn’t really possible to do what you are thinking. The host thread-context-device affinity needs to be static, so you establish it at the beginning of the execution and keep it until the end. This was the main source of pain in using OpenMP with CUDA for multi-gpu: most openMP runtimes keep a pool of threads and just pick the next free host thread to perform an action. There was not any guarantee that the physical host thread (and hence device context) would be constant throughout the life of an application. Ideally you want to have “consumer” threads holding a statically assigned GPU, then have one or more “producer” threads generating work for those threads. There is also a context migration API in CUDA, which allows a context to be moved migrated from one thread to another.
In CUDA 4.0, the approach can be different, because it is now possible for a single host thread to establish multiple contexts and work with multiple devices directly. Although I haven’t really played with threaded code in CUDA 4.0 yet, the approach there might be to have a parent thread establish contexts one each GPU, then pass devices to threads as required. But I haven’t yet started migrating any threaded multi-gpu code to CUDA 4.0, so I can’t really offer any specific advice on that.
It would be worth looking into the context management part of the cuda drive API. This allows you to have several threads work on the same or different devices.
From personal experience I have found it easier to use worker threads that maintain their individual context to CUDA, and then divide the jobs to these threads and combine them in the end. This all depends on your problem of course, but I had to switch context so often that it was more trouble than it was worth.