Not answered question? CUDA and OpenMP?

Hi everybody,

I have two questions about CUDA streams on GTX 480 and OpenMP

  1. Can I load data on/from the GPU using the OpenMP threads and CUDA streams in parallel? Is it also possible to use OpenMP to launch the concurrent kernels on different streams? If those two options are not allowed, is it because the GPU/CPU communictations can only be done through thread0.

  2. I have a square SIMD algorithm, that is to say an SIMD (SIMDB B for Big) that calls another SIMD (SIMDS S for Small). I want to know whether it is more optimal to use one big loop (which is the SIMDB) that launches the kernel that performs SIMDS, or I should reduce the length of the loop which launches concurrent kernels on the different streams? In my opinion, it depends on the memory size but I am not sure how.

Thank you for your response

  1. By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{

 // acquire context and a stream

kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

  1. By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{

 // acquire context and a stream

kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again for your answer

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again for your answer

Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…

Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…

Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.

Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Every kernel launch is asynchronous on CPU side, which means it returns immediately on host while the kernel has not been really executed yet.

If the host is fast enough and feed another kernel in a different stream, you may get the chance to run these two kernels concurrently.

But it only happens if they are in a different stream.

I think there is a subsection about stream in the latest programmer’s guide.

Every kernel launch is asynchronous on CPU side, which means it returns immediately on host while the kernel has not been really executed yet.

If the host is fast enough and feed another kernel in a different stream, you may get the chance to run these two kernels concurrently.

But it only happens if they are in a different stream.

I think there is a subsection about stream in the latest programmer’s guide.

Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.

Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.

Hmm, I have not tried it yet, but… this link makes me think that perhaps it is possible.

http://developer.download.nvidia.com/compu…RT__DRIVER.html

Now, the way I see the way to handle multi-threaded CUDA app is that you need Context Management functions in order to push and pop the context between threads. One thread pushes the context, queues its async calls into its stream, pops it. Perhaps with a mutex to prevent race conditions. Not that hard. Makes it simple to implement multi-GPU solutions too by having 2 threads per GPU, but that needs a little more magic to keep the contextes seperate per GPU.

The problem though seems to be that Context Management is a driver API only thing, or at least according to the documentation. In the programming guide it states that you should use Runtime, or Driver API, not both. But the link above seems to suggest its possible, and advised to use the driver functions cuCtxPopCurrent and cuCtxPushCurrent while otherwise primarily using the runtime API.

Another alternative is to fully use the Driver API instead of the Runtime API.

As said before, I have never actually programmed a multi-threaded CUDA app, but from what I see it is definitely supported, even in the Runtime API.