Not answered question? CUDA and OpenMP?

Lokman · September 5, 2010, 5:24am

Hi everybody,

I have two questions about CUDA streams on GTX 480 and OpenMP

Can I load data on/from the GPU using the OpenMP threads and CUDA streams in parallel? Is it also possible to use OpenMP to launch the concurrent kernels on different streams? If those two options are not allowed, is it because the GPU/CPU communictations can only be done through thread0.
I have a square SIMD algorithm, that is to say an SIMD (SIMDB B for Big) that calls another SIMD (SIMDS S for Small). I want to know whether it is more optimal to use one big loop (which is the SIMDB) that launches the kernel that performs SIMDS, or I should reduce the length of the loop which launches concurrent kernels on the different streams? In my opinion, it depends on the memory size but I am not sure how.

Thank you for your response

cuuuuuda · September 6, 2010, 8:12pm

By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{

 // acquire context and a stream

kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

Hi everybody,

I have two questions about CUDA streams on GTX 480 and OpenMP

Can I load data on/from the GPU using the OpenMP threads and CUDA streams in parallel? Is it also possible to use OpenMP to launch the concurrent kernels on different streams? If those two options are not allowed, is it because the GPU/CPU communictations can only be done through thread0.

I have a square SIMD algorithm, that is to say an SIMD (SIMDB B for Big) that calls another SIMD (SIMDS S for Small). I want to know whether it is more optimal to use one big loop (which is the SIMDB) that launches the kernel that performs SIMDS, or I should reduce the length of the loop which launches concurrent kernels on the different streams? In my opinion, it depends on the memory size but I am not sure how.

Thank you for your response

cuuuuuda · September 6, 2010, 8:12pm

By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{

 // acquire context and a stream

kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

Hi everybody,

I have two questions about CUDA streams on GTX 480 and OpenMP

Can I load data on/from the GPU using the OpenMP threads and CUDA streams in parallel? Is it also possible to use OpenMP to launch the concurrent kernels on different streams? If those two options are not allowed, is it because the GPU/CPU communictations can only be done through thread0.

I have a square SIMD algorithm, that is to say an SIMD (SIMDB B for Big) that calls another SIMD (SIMDS S for Small). I want to know whether it is more optimal to use one big loop (which is the SIMDB) that launches the kernel that performs SIMDS, or I should reduce the length of the loop which launches concurrent kernels on the different streams? In my opinion, it depends on the memory size but I am not sure how.

Thank you for your response

Lokman · September 7, 2010, 11:37am

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again for your answer

By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{
 // acquire context and a stream
kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

Lokman · September 7, 2010, 11:37am

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again for your answer

By OpenMP thread you mean those threads implicitly managed by the openMP compiler and run-time system? I never tried that but as long as every thread acquires/releases context (like a critical section), it should be ok.

Sth. like this (sorry I don’t quite remember openMP pragmas):

#progam parallel for

for ()

{
 // acquire context and a stream
kernel<<<…, mystream>>>();

// release context

}

I would say it depends on which card you have. If you have fermi, it is beneficial to run on different streams because they can be executed concurrently.

Hope it helps.

KingGuru · September 7, 2010, 12:02pm

Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…

KingGuru · September 7, 2010, 12:02pm

Sorry for this reply, if it posts… I’m trying to make a new thread but when ever I do so, it just send me to the root of nvidia forums… so I want to se if I can reply on a thread…

avidday · September 7, 2010, 12:14pm

Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.

avidday · September 7, 2010, 12:14pm

Using the context migration mechanism available in the cuda runtime API. I really would recommend against using OpenMP for this. OpenMP thread affinity is normally opaque and it becomes very difficult to manage everything correctly with lots of threads, because typical OpenMP runtimes keep a pool of threads and just grab whatever is free to service a parallel code section. There is no guarantee that the same code will get the same thread twice. You would be better off using an explicit threading API like boost or native threads with persistent threads for that.

Lokman · September 7, 2010, 1:39pm

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Lokman · September 7, 2010, 1:39pm

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Lokman · September 8, 2010, 4:46am

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Lokman · September 8, 2010, 4:46am

Even if I use native threads I still don’t know the answer fro my previous question: How can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again

Lokman · September 9, 2010, 3:43pm

I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Lokman · September 9, 2010, 3:43pm

I hope that someone from the Nvidia team can provide me an answer for my last question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels using different threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs) :

But I thought that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

cuuuuuda · September 14, 2010, 3:46am

Every kernel launch is asynchronous on CPU side, which means it returns immediately on host while the kernel has not been really executed yet.

If the host is fast enough and feed another kernel in a different stream, you may get the chance to run these two kernels concurrently.

But it only happens if they are in a different stream.

I think there is a subsection about stream in the latest programmer’s guide.

cuuuuuda · September 14, 2010, 3:46am

Every kernel launch is asynchronous on CPU side, which means it returns immediately on host while the kernel has not been really executed yet.

If the host is fast enough and feed another kernel in a different stream, you may get the chance to run these two kernels concurrently.

But it only happens if they are in a different stream.

I think there is a subsection about stream in the latest programmer’s guide.

Lokman · September 14, 2010, 9:00am

Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.

Lokman · September 14, 2010, 9:00am

Ok, so you confirm the fact that we cannot launch concurrent kernels using different threads and we can only launch concurrent kernels using the thread0 of the CPU.

Antagonistic · September 14, 2010, 1:34pm

Hmm, I have not tried it yet, but… this link makes me think that perhaps it is possible.

http://developer.download.nvidia.com/compu…RT__DRIVER.html

Now, the way I see the way to handle multi-threaded CUDA app is that you need Context Management functions in order to push and pop the context between threads. One thread pushes the context, queues its async calls into its stream, pops it. Perhaps with a mutex to prevent race conditions. Not that hard. Makes it simple to implement multi-GPU solutions too by having 2 threads per GPU, but that needs a little more magic to keep the contextes seperate per GPU.

The problem though seems to be that Context Management is a driver API only thing, or at least according to the documentation. In the programming guide it states that you should use Runtime, or Driver API, not both. But the link above seems to suggest its possible, and advised to use the driver functions cuCtxPopCurrent and cuCtxPushCurrent while otherwise primarily using the runtime API.

Another alternative is to fully use the Driver API instead of the Runtime API.

As said before, I have never actually programmed a multi-threaded CUDA app, but from what I see it is definitely supported, even in the Runtime API.

Topic		Replies	Views
Cuda with openMP CUDA Programming and Performance	10	15674	June 13, 2010
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6695	July 2, 2010
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6913	January 11, 2010
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14047	September 12, 2015
Concurrent kernels and CPU threads Not answered question CUDA Programming and Performance	14	3491	September 17, 2010
CUDA and OpenMP CUDA Programming and Performance	0	2606	September 2, 2010
GPU sharing among different application with different CUDA context CUDA Programming and Performance	23	18514	December 17, 2020
Can't get any concurrency on simple vector add across multi-GPU and streams CUDA Programming and Performance	17	5826	April 28, 2012
Stream Concurrency (or lack thereof) on GTX 480 CUDA Programming and Performance	6	2508	July 15, 2010
simpleStreams mistake ?! Concurrent kernels CUDA Programming and Performance	18	9558	June 30, 2010

Not answered question? CUDA and OpenMP?

Related topics