Can I use OpenACC to parallelize a code with function calls?

I’m trying to parallelize my sequencial C code and offload to NVIDIA GPU with OpenACC(PGI compiler)

My code is written as a sequencial code. And calling very long functions frequently, like below.

int main()
{
   // blah blah...
   for(i=0; i<10; i++)
   {
      for(j=0; j<20; j++)
      {
          big_function(a,b,c);
      }
   }
   // blah blah...
}

int big_function(a,b,c)
{
   small_function_1(a);
   small_function_2_with_data_dependencies(b);
}

That kind of case case, big_function() can parallelize and run on GPU?

I declared whole of for loop to parallized region using #pragma acc kernels . like below.

#pragma acc routine
int big_function(int a, int b, int c);
#pragma acc routine
int small_function_1(int a);
#pragma acc routine
int small_function_2_with_data_dependencies(int b);

int main()
{
   // blah blah...
   #pragma acc data ~~~~
   #pragma acc kernels
   for(i=0; i<10; i++)
   {
      for(j=0; j<20; j++)
      {
          big_function(a,b,c);
      }
   }
   // blah blah...
}

int big_function(a,b,c)
{
   small_function_1(a);
   small_function_2_with_data_dependencies(b);
}

But the compiled file takes very long time to finish. And the result was not correct.

Can I use OpenACC to parallelize sequecial code which using many function calls?

Or Do I have to break and divide big_function() to small parts?


I’m using
Ubuntu 14.04 LTS
PGI 15.7
CUDA 6.5
NVIDIA GeForce 960GTX (Maxwell architecture)

Can I use OpenACC to parallelize sequecial code which using many function calls?

Yes, and you have the right idea on how to do this. The only thing I would add is “seq” after “#pragma acc routine” to indicate that these are sequential functions. I would also add “#pragma acc routine seq” just before the function definition.

Or Do I have to break and divide big_function() to small parts?

You do not have to. Though larger kernels do typically use more resources so may not be as performant as smaller kernels. However, this needs to weighted against the added cost of adding additional kernel launches. Since this is more of optimization step, I wouldn’t worry about this now.

But the compiled file takes very long time to finish.

Try adding the flag “-ta=tesla:cc50”. By default we generate binaries that will run on four different target compute capabilities. This helps in portability but can lead to longer compile times, especially for large pieces of code. “cc50” say to compile for only Maxwell architectures, thus should help reduce the compile time.

small_function_2_with_data_dependencies

This concerns me. What are the dependencies? If it’s only within the function, then it shouldn’t matter given it will be run sequentially. However, if the dependencies are across the outer parallel loops, this may be the source of your wrong answers.

And the result was not correct

Without a reproducing example, I can’t tell why you would be getting incorrect answer.

Hope this helps,
Mat