Cuda with openMP

Hi there,

I’m fairly new to Cuda. I’ve read Kirk and Hwu’s book and played around a little but I’ve not done anything substantial with it.

I was wondering about the possibility of launching multiple kernal functions concurrently.

For example I have an algorithm that requires a three separate stages of calculation, lets call them pre-processing, fft and post-processing.

The FFT can’t start until the pre-processing has completed and the post-processing can’t start until the FFT is finished but if I need to run the algorithm on multiple frames of data then the operations could be pipelined so that while frame 2 is being transformed by the FFT frame 1 is in post processing and frame 3 is in pre-processing.

As I understand it if I have three kernal functions, one for each stage Cuda on it’s own does not allow me to execute all three concurrently. Is that correct?

Assuming that the GPU has enough resources to carry out all three at once I could launch each stage from a separate thread but I’m not sure how a single GPU would handle that.

Is this an approach that would work? Has it been done before?

*Edit * It seems that what I’m trying to do is not possible. I would still be interested in any comments on this sort of problem.

Any comments are much appreciated.

What you want to do is possible on compute 2.0 GPUs. Just execute each of your independant pipelines in different streams and the driver/GPU will overlap them.

Thanks Mr Anderson,

I’m sorry I don’t know what you mean by compute 2.0. Did a quick search but didn’t find anything.

The GPUs I have access to currently are all Quadros.

Different NVIDIA GPUs have different CUDA features. These feature sets are given numbers, called “compute capabilities.” They are defined in one of the Appendices in the CUDA Programming Guide. Briefly:

compute capability 1.0: Original CUDA architecture of the 8800 GTX.

compute capability 1.1: Added atomic operations in global memory

compute capability 1.2: Added much improved memory controller for better coalescing behavior, atomic operations in shared memory, warp voting

compute capability 1.3: Added double precision

compute capability 2.0: Fermi architecture, adds lots of new features described in the whitepaper, including concurrent execution of kernels on different streams.

Appendix A of the CUDA programming guide lists the compute capability of all GPUs (as of when the document was written, anyway).

Ah Ok. Thanks for that. It seems that my GPUs won’t do the job. It’s good to know that it’s supported though if I can get my hands on a Fermi chip.

(Warning: commercial software required: the Kappa library which is available at

I wrote the Kappa library specifically for this. It uses a Producer/Consumer framework to allow specifying the steps of a task. If the steps are independent, it will automatically try to schedule them in parallel. Its actual ability to execute CUDA kernels in parallel depend on the underlying GPU being FERMI as mentioned previously. It is still able to overlap CPU (including OpenMP) and GPU operations automatically even without FERMI–either CPU calculations and/or memory transfers can overlap GPU calculations. Also, you can write your software now using non-FERMI Quadros and not change your software if/when you get FERMI class GPUs (Quadros or other GPUs such as the C2060).

Another also, the Kappa library (or the CUDA API) supports multiple GPUs so that even with non-FERMI Quadros you can get concurrent execution (but all scheduling by the Kappa library is done separately per GPU–you would have to handle the scheduling dependencies between GPUs yourself). My recommendation if you try to go the mulit-GPU route would be to use separate GPUs for separate task streams without trying to coordinate between GPUs.

Sorry for the commercial O:)

for (i = 0; i < n_sframes; i++) {



for (i = 0; i < n_sframes; i++) {

cudaStreamSynchronize(stream[i]) //GPU has to finish Preprocess for sream[i] and executes freely then the MainKernel for stream[i]


for (i = 0; i < n_sframes; i++) {

cudaStreamSynchronize(stream[i]) //GPU has to finish Mainprocess for sream[i] (preprocess is already finished) and executes freely then the PostKernel for stream[i]


It is compute capability 2.0 sorry about 1.1 I said previously, :-)

A nice asynchronous memory transfer and all set and go.

I think this should work. Is there sth wrong with this anyone?

You should break the video in chunks for memory requirements and create a nice asychronous cycle of input output feedance of frames from disk->host->device->host->disk.
Hope this helps and I am correct with the synchronize usage. Again correct me if I am wrong, always a learmer.
If the compiler sees successive cudaStreamSynchronize calls linear this should work for sure if not forget it, it is wrong. Can someone answer this question please? I am also interested in the correctness of this code.
It is easy to check this in device emulation mode with printf commands just display the index of the stream. I will check it myself also in the future.
This is interesting because you can fully grasp the scheduling in the queue of concurrent kernel execution in the Fermi architecture. Anyway.


What you (AlexanderAgathos) show would work–but it would not be as efficient as it could be. The stream synchronize blocks if the stream operations are not complete (that is the point of it after all). This means that, if the host for-loops are serial iterations through the loops, that possibly independent stream operations are blocking on the host serial for loop. It gets complicated to have non-serial (parallel) host for-loops with CUDA since CUDA has a GPU context per host thread --there is extra work involved in context switching in order to support multiple host threads and I have not seen it done yet (with OpenMP or otherwise–OpenMP is usually used for multi-GPU–not concurrent kernel launch).

On further thought, even if CUDA allowed a GPU context to span multiple host threads so that the for-loops were made parallel (say via OpenMP), then it would not still be as efficient as it could be–it (OpenMP at least) still waits for each for-loop step to completely finish before starting the next step. Only when CUDA supports as many host threads per GPU context and as many concurrent GPU kernel executions as you could ever possibly want would this be truly efficient. My understanding of the current state of CUDA 3.0 with FERMI GPU is one host thread per GPU context and 4 concurrent kernel executions.

Assuming CUDA allow(ed) as many host threads per GPU context as you wanted, the efficient way to code this would be (using OpenMP–the following is pseudo-code where inum_threads is some integer number):


#pragma omp parallel num_threads(inum_threads)

int i = omp_get_thread_num()





BTW, Kappa handles this by making the kernel launches (and memory transfers) asynchronous and having a scheduler to sort things out (hint: the scheduler is not just a serial for loop iteration across streams). The way that Kappa handles this scales properly from 1.0 GPUs up to FERMI (or beyond) very efficiently, does it today, and stays efficient even if the steps in question become subcomponents of some future component (OpenMP will force a synchronization at each subcomponent step/end of parallel regions and will therefore lose scaling efficiency).

Thanks for the reply. Though I am not yet convinced about the serialization in the for loops and the scheduling that CUDA does by itself.

For instance take a look at this thread :

If it was done serial then all the kernels would execute first and then the memory transfer would occur.

So there is a need to have a full documentation of what is going on with streams and what scheduling is done when the compiler comes accross a synchronization of a stream since remember a stream synchronization is taking place with the asychronous memory transfer this is why they have a feed for the stream in the arguments so that it can be queried and used by the device to coordinate a memory transfer.

So in this for loop

it is actually a

for (i){


transfer the memory back to host asynchronously


so here comes exactly what I said.



According to the NVIDIA documentation, operations associated with a stream execute in the order they are invoked (on the host thread) and operations associated with separate streams can execute in any order (and probably will). Also according to the NVIDIA documentation, invoking stream synchronization causes the host thread to block until all operations associated with the stream prior to the synchronization event on the stream have registered as completed (note that completed and registered as completed are two slightly different things–completion is necessary but not sufficient for registration). (Alternatively, you may put an event on the stream and ask, without blocking, if all prior stream associated operations are registered as completed–this is implied because the event has registered as completed and it was invoked after the other operations.) This is all of the information needed–assuming that more than one host thread does not use a given stream so that the order of invocation of operations associated with a stream is determined (and deterministic).

You can either: associate all dependent operations on the same stream in the order that they need to complete such that memory copies necessarily are executed in the correct order with respect to kernel execution (and then use blocking synchronization or asynchronous checking of events on the stream to determine that execution completes up to certain points prior to starting other operations) or (as Kappa does) put operations on different streams and have a scheduler that uses the stream event completion statuses and other information to ensure proper ordering of operations–even operations that are not CUDA related. With Kappa, you can use the first method, stream association, to ensure that certain operations proceed efficiently in the correct order but Kappa does not rely solely on this mechanism.

Indeed the second is faster. Thanks for the info.