Parallelize across CPU and GPU cores simultaneously


I would like to speed-up my code using a Tesla card. So far I use OpenMP in Fortran to parallelize my code across N CPU cores. In addition I would like to add an extra layer of parallel computing using one Tesla graphic card. In particular I would like to outsource one function to the GPU within my loop that parallelizes over the CPU cores. Does that sound feasible? My understanding is that I can either parallelize on the CPU and after these computations I can call one kernel that parallelizes on the GPU. Therefore I am not sure whether using multiples CPU cores and one Tesla card I should be able to parallelize code on the CPU and GPU in parallel (rather than sequentially). Or is the compiler efficiently spreading computations across CPU and GPU cores? I’d also be very grateful if you could point me to an example (if any).

Thank you very much.

Hi Pete,

It’s certainly feasible to have OpenACC within OpenMP parallel regions. The tricky part is that with OpenACC you still have manage the discrete data on the target device (in your case the Tesla GPU) for each OpenMP thread. This can be a bit unnatural for OpenMP given the assumption that all threads can access the same memory. Generally I recommend using MPI+OpenACC for multi-GPU programming since with MPI the data is already discrete. Though OpenMP+OpenACC is still possible with a bit more effort.

The easiest thing to do is have the OpenACC code in a separate function that each of the OpenMP threads call. So long as you’re passing in each thread’s individual data, then it’s straight forward. The caveat being that you’ll need to move the data back and forth between the CPU and GPU every time you enter the OpenACC region thus loose some performance. If you assume that your program will only use a single device, then you can move the OpenACC data region higher up in your program given all the OpenMP threads will shave the same device context and thus can share data on the device.

With every thread using the same device you will have contention. If the kernel fully saturates the GPU, then each CPU will need to wait for the others to complete before being able use the device. You’ll get some overlap as one kernel releases multi-processors and the next kernel begins to use them. If the kernel uses few resources, then multiple can be running at the same time.

  • Mat

Hi Mat,

Thank you very much, this is very useful.

Following up on your last paragraph. Is there an OpenACC option to set that for N parallel CPU thread when calling the device only accesses 1/N of GPU processors to avoid overlap or waiting times?

I intend to perform linear interpolation on a multidimensional grid on my device. For each CPU thread the linear interpolation is done on the same function so the multidimensional function is identical (across all CPU cores) and, therefore, should be shared among all Cuda threads. In addition, each CPU thread will also need to pass a specific grid point combination to the device and receive the solution from linear interpolation to the specific CPU thread for further computations within each separate CPU thread. Does this make sense? Is there a way to avoid overlap/waiting times on the device?


Is there an OpenACC option to set that for N parallel CPU thread when calling the device only accesses 1/N of GPU processors to avoid overlap or waiting times?

There’s no direct way of stating that a kernel should only 1/Nth of the number of multiprocessors on a GPU. You can set the number of OpenACC gangs to use but you’d need to figure out how many gangs (CUDA blocks) can be run on each multi-processor (SM) (it can range from 1 to 16 depending on the device, the number of threads per block, and the amount of resources such as registers and shared memory each thread uses). You could simply assume one gang per SM but would most likely be loosing performance since you wouldn’t fully utilize the device.

Is there a way to avoid overlap/waiting times on the device?

Assuming you have a large enough amount of work to saturate the GPU, it shouldn’t really matter much if you do all the work in a single kernel launch or divide it across multiple concurrent kernel launch. The same amount of work is performed. There’s some difference in the overhead cost but I doubt it would make much of an overall impact. Of course this is speculative, especially not knowing anything about your program, so I would suggest experimenting to see what’s best.

Now if each OpenMP thread only had a small amount of work for the GPU, then it makes sense to have multiple CPU threads using the same GPU to increase it’s utilization.

  • Mat