OpenMP to CUDA

abhijat.tilak · December 3, 2017, 1:40pm

I have an openmp code (c++) that goes something like this:

#pragma omp parallel for

for(int i = 0; i < NumCalculations; i++)
{
int iThread = omp_get_thread_num();

d[i] = foo_collection[iThread].DoSomething(a[i],b[i],c);

}

foo_collection is defined as:

std::vector foo_collection();

The class “foo” is old legacy FORTRAN 77 code reincarnated in it’s C++ avatar and performs thermodynamic calculations. The class has about a dozen methods to do different type of calculations. Sometimes the one or more input parameters to DoSomething() can be a array (std::vector) and output is usually a double or an array. An instance of class “foo” when initialized holds some data (critical properties, molecular weights etc). The storage is negligible (about 200-300 kb max).

I have some nested loops, but the above form is most common. As far as hardware goes, I have Quadro M1000M.
Any suggestions on how to convert to CUDA, especially the nested loops ?

Thanks in advance

njuffa · December 3, 2017, 8:01pm

It is difficult to provide specific advice for vague questions with few details. What have you tried so far? What bottlenecks did you find?

I guess the first question to ask is whether you have sufficient parallelism to make it worthwhile to port this to the GPU. With OpenMP one typically has tens of threads. With GPUs, you would want tens of thousands of threads. Also, if the working set is tiny (300 KB fit into the L2 cache of modern CPUs), a CPU may be the best platform to use, especially if you use SIMD computation aggressively.

A straightforward way to parallelize is that thread i computes d[i], i.e. each thread is responsible for computing one or several pieces of destination data, and gathers its source data accordingly. Refine this to regularize read access patterns as much as possible (GPU memory throughput suffers dramatically with random access patterns).

abhijat.tilak · December 4, 2017, 8:25pm

Thanks for your response.
I haven’t tried anything yet. I just finished a basic tutorial on CUDA.

Right now there’s no urgent need for moving to CUDA, however we will need to move beyond openmp if models keep getting larger and larger. If the trend continues, we could be looking at upwards of 100K calls to DoSomething() per iteration/timestep. There’s a good chance old timers on this forum have encountered situations like this and I would like to hear about their experiences and suggestions.

Topic		Replies	Views
Openmp + CUDA how to handle multi GPUs with openmp CUDA Programming and Performance	1	1139	June 10, 2009
CUDA + OpenMP CUDA Programming and Performance	2	711	December 8, 2016
mixing openmp and CUDA fortran Legacy PGI Compilers	2	6722	March 9, 2010
OpenMP with Cuda Documentation CUDA Programming and Performance	2	1172	August 9, 2013
Help! I don't understand the threading model CUDA Programming and Performance	6	3010	April 20, 2008
openMP translation CUDA Programming and Performance	8	7728	June 10, 2010
What do I change CUDA Programming and Performance	0	946	December 7, 2008
Suggestion for CUDA CUDA Programming and Performance	7	4299	April 29, 2008
Differences between GPU and CPU multithreading CUDA Programming and Performance	2	20162	December 4, 2009
OpenMP & CUDA CUDA Programming and Performance	6	5213	September 22, 2008

OpenMP to CUDA

Related topics