OpenMP to CUDA

I have an openmp code (c++) that goes something like this:

#pragma omp parallel for

for(int i = 0; i < NumCalculations; i++)
int iThread = omp_get_thread_num();

d[i] = foo_collection[iThread].DoSomething(a[i],b[i],c);


foo_collection is defined as:

std::vector foo_collection();

The class “foo” is old legacy FORTRAN 77 code reincarnated in it’s C++ avatar and performs thermodynamic calculations. The class has about a dozen methods to do different type of calculations. Sometimes the one or more input parameters to DoSomething() can be a array (std::vector) and output is usually a double or an array. An instance of class “foo” when initialized holds some data (critical properties, molecular weights etc). The storage is negligible (about 200-300 kb max).

I have some nested loops, but the above form is most common. As far as hardware goes, I have Quadro M1000M.
Any suggestions on how to convert to CUDA, especially the nested loops ?

Thanks in advance

It is difficult to provide specific advice for vague questions with few details. What have you tried so far? What bottlenecks did you find?

I guess the first question to ask is whether you have sufficient parallelism to make it worthwhile to port this to the GPU. With OpenMP one typically has tens of threads. With GPUs, you would want tens of thousands of threads. Also, if the working set is tiny (300 KB fit into the L2 cache of modern CPUs), a CPU may be the best platform to use, especially if you use SIMD computation aggressively.

A straightforward way to parallelize is that thread i computes d[i], i.e. each thread is responsible for computing one or several pieces of destination data, and gathers its source data accordingly. Refine this to regularize read access patterns as much as possible (GPU memory throughput suffers dramatically with random access patterns).

Thanks for your response.
I haven’t tried anything yet. I just finished a basic tutorial on CUDA.

Right now there’s no urgent need for moving to CUDA, however we will need to move beyond openmp if models keep getting larger and larger. If the trend continues, we could be looking at upwards of 100K calls to DoSomething() per iteration/timestep. There’s a good chance old timers on this forum have encountered situations like this and I would like to hear about their experiences and suggestions.