I have two questions about CUDA streams on GTX 480 and OpenMP
Can I load data on/from the GPU using the OpenMP threads and CUDA streams in parallel? Is it also possible to use OpenMP to launch the concurrent kernels on different streams? If those two options are not allowed, is it because the GPU/CPU communictations can only be done through thread0.
I have a square SIMD algorithm, that is to say an SIMD (SIMDB B for Big) that calls another SIMD (SIMDS S for Small). I want to know whether it is more optimal to use one big loop (which is the SIMDB) that launches the kernel that performs SIMDS, or I should reduce the length of the loop which launches concurrent kernels on the different streams? In my opinion, it depends on the memory size but I am not sure how.
By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.
Sth. like this (sorry I don’t quite remember openMP pragmas):
#progam parallel for
for ()
{
// acquire context and a stream
kernel<<<…, mystream>>>();
// release context
}
I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.
By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.
Sth. like this (sorry I don’t quite remember openMP pragmas):
#progam parallel for
for ()
{
// acquire context and a stream
kernel<<<…, mystream>>>();
// release context
}
I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.
But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…
Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…
Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.
Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.
Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :
But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :
But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?
Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.
Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.
Now, the way I see the way to handle multi-threaded CUDA app is that you need Context Management functions in order to push and pop the context between threads. One thread pushes the context, queues its async calls into its stream, pops it. Perhaps with a mutex to prevent race conditions. Not that hard. Makes it simple to implement multi-GPU solutions too by having 2 threads per GPU, but that needs a little more magic to keep the contextes seperate per GPU.
The problem though seems to be that Context Management is a driver API only thing, or at least according to the documentation. In the programming guide it states that you should use Runtime, or Driver API, not both. But the link above seems to suggest its possible, and advised to use the driver functions cuCtxPopCurrent and cuCtxPushCurrent while otherwise primarily using the runtime API.
Another alternative is to fully use the Driver API instead of the Runtime API.
As said before, I have never actually programmed a multi-threaded CUDA app, but from what I see it is definitely supported, even in the Runtime API.