Porting if-else statements and function evaluation to GPU

I am trying to port the following code to the GPU, keeping all data in GPU memory space:

if (function(array,a) > x){
	out = y;
} else if (function(array,b) > x){
	out = z;
}

To keep the result of function(array,a) and function(array,b) on the GPU, I must port the entire block to the GPU. function() loops over the array elements and manipulates them with the integer a. It has a pragma acc loop directive to parallelize the loop. If I compile function() on the GPU using the OpenACC routine directive (with no gang, worker or vector clause), I could do the following:

#pragma acc parallel copyin(array,a,b,x,y,z) copyout(out) num_gangs(1) vector_length(1)
{
... same code as on the CPU ...
}

But this code is really slow. I believe this is because the num_gangs(1) vector_length(1) clause is being applied within the device routine function(), which is therefore executed sequentially by a single thread. The following is much faster:

#pragma acc enter data copyin(array,a,b,x,y,z) create(s1,s2,out)

#pragma acc parallel present(array,a,b,s1,s2)
{
	s1 = function(array,a);
	s2 = function(array,b);
}

#pragma acc parallel present(s1,s2,x,y,z,out)
{
	if (s1 > x){
		out = y;
	} else if (s2 > x){
		out = z;
	}
}

#pragma acc exit data delete(array,a,b,x,y,z,s1,s2) copyout(out)

However, this requires evaluating function(array,b) where it is a priori not necessary. How can I best port my if-else statement and function() evaluations to the GPU?

Hi LO_UZH,

This would be a case to use nested parallelism, but I question if it’s necessary to move the serial code over to the GPU to begin with.

Typically, you would run the serial code on the CPU and only manage the device data at this level. The compute regions would most likely be in your functions placed around your parallel loops. Putting scalar code on the accelerator should be avoided unless there’s a compelling reason to do so.

Something more like:

int function (float * restrict arr, float val) {
...
#pragma acc kernels loop present(arr)
for (I=0; I < size; ++I) {
    ... some parallel code ...
}
....
return something;
}
...
#pragma acc enter data copyin(array) 
...
if (function(array,a) > x){ 
   out = y; 
} else if (function(array,b) > x){ 
   out = z; 
}
...
#pragma acc exit data delete(array)
  • Mat

Hi Mat,

Thanks for the reply!

This example doesn’t properly illustrate what’s really happening in one of my programs. In reality, the array is generated on the device, and function contains a parallel loop to sum up all array elements greater than a certain value (a or b in my example). This is done thousands of time during the execution of the program. I want to avoid all data transfers except for out at each iteration, since this really hurts performance (small transfer but large latency).

What works best is the following:

void function( const int *array, int &a, int &out){
	int sum;

	#pragma acc enter data create(sum)

    #pragma acc parallel present(sum) num_gangs(1) vector_length(1)
    sum = 0;
	
	#pragma acc parallel loop present(sum,array,a[:1]) reduction(+:sum)
	{
		for (int i=0; i<1000000; ++i){
			if (array[i] > a){
		  		sum += array[i];
			}
		}
	}

	#pragma acc parallel present(out[:1],sum) num_gangs(1) vector_length(1)
	out = sum;

	#pragma acc exit data delete(sum)
}

And in main routine:

int main(){

... array is generated, a, b, x, y, z are already on the device

int s1, s2;

#pragma acc enter data create(s1,s2)
function(array,a,s1); 
function(array,b,s2);

// Now s1 and s2 are on the device and contain the evaluation of function

#pragma acc parallel present(s1,s2,x,y,z,out) num_gangs(1) vector_length(1)
{ 
   if (s1 > x){ 
      out = y; 
   } else if (s2 > x){ 
      out = z; 
   }

#pragma acc exit data copyout(out) delete(...)
}

This works well but forces the evaluation of both function(array,a) and function(array,b) before the output can be returned, as I said in my previous post.

How could I use nested parallelism here, while having no data movements to/from device except for out?

Hi LO_UZH,

I think you’re making this more complex than it needs to be. There’s no need to manage the scalars here and the only transfer would the reduction variable (which you said was ok). By doing it this way, you’ve added several sequential kernel launches which will most likely add more overhead than transferring the one reduction variable.

// Does "a" need to be passed by reference?  I'm assuming not
// Also, I'm changing the function to return an int instead of using "out"
int function( const int *array, int a){ 
   int sum; 

// no need to use these pragmas, only put the reduction variable in a 
// data region if you don't want to use it on the host and/or if it's used in 
// subsequent compute regions. 
// Also, while you might reduce the data transfer, you're adding the overhead
//  of launching a kernel.
// #pragma acc enter data create(sum) 
// #pragma acc parallel present(sum) num_gangs(1) vector_length(1) 
    sum = 0; 
    
// Note that "a" will be passed in as an argument to the kernel and be 
// initialized to a local variable.  That variable will most likely be put in 
// a register for fast access.  The way you had it, "a" is global so instead
// of just passing in the value, you're pass to the kernel a pointer to "a".
// Hence, no savings in kernel arguments plus you force every thread to 
// go get a's value from global memory.

   #pragma acc parallel loop present(array) reduction(+:sum) 
   { 
      for (int i=0; i<1000000; ++i){ 
         if (array[i] > a){ 
              sum += array[i]; 
         } 
      } 
   } 

// You're not saving anything here.  You still transfer back the "out"
// variable, (why not just transfer back "sum"?) plus you've added an 
//  additional kernel launch.
//   #pragma acc parallel present(out[:1],sum) num_gangs(1) vector_length(1) 
//   out = sum; 

//   #pragma acc exit data delete(sum) 
     return sum;
} 

....

int main(){ 

... array is generated, a, b, x, y, z are already on the device 

   if ( function(array,a) > x){ 
      out = y; 
   } else if (function(array,b) > x){ 
      out = z; 
   } 
}
  • Mat

Hi Mat,

The problem with what you’re suggesting is that “function” returns “sum” to the host, and the if-else statement is evaluated on the host. It needs to be done on the device.
The output of “function” should therefore stay on the device since 1) I do not need or want it on the host, and 2) it is used for the if-else statement on the device later on in the program. Remember that my goal is to port the entire code to the device (function evaluations + if-else statement). Nothing should be done on the host.

Basically, what I would like is:

// copyin all the necessary data to the device
// No more D2H or H2D data transfer, function() executed on device
if (function(array,a) > x){
   out = y;
} else if (function(array,b) > x){
   out = z;  
}
// Now copyout out in a D2H transfer

We should therefore have “array” copied in (and maybe a, b, x, y and z but I’m not sure), and then only “out” copied out. The result of function(array,a) and function(array,b) is unknown to the host.|

Many thanks for your help.

Hi LO_UZH,

Although for this code there is no reason to put this IF statement on the device, I assume in your real application there is.

Again, the best solution here would be nested parallelism. It would allow you to launch a sequential kernel that then launches the parallel kernel within the function. But given we don’t have it implemented yet, it’s not something I can offer you at this time.

Instead, let’s use “routine” to move the function over to the device (alternately you could inline the function). The caveat here is that reductions in “routine” wont be available until the 15.x release, only work on CC3.5 or later devices, and limited to use in “vector” or “worker” routines.

#pragma acc routine vector
int function( const int *array, int a){ 
   int sum; 
   sum = 0;  
   #pragma acc loop present(array) reduction(+:sum) 
   { 
      for (int i=0; i<1000000; ++i){ 
         if (array[i] > a){ 
              sum += array[i]; 
         } 
      } 
   } 
    return sum; 
} 

.... 

int main(){ 

... array is generated, a, b, x, y, z are already on the device 

#pragma acc parallel 
{
   if ( function(array,a) > x){ 
      out = y; 
   } else if (function(array,b) > x){ 
      out = z; 
   } 
}
}

Also, you’ll be limited to a single gang here so it will be detrimental to performance unless you have more parallelism outside this IF statement.

  • Mat

Hi Mat,

I wasn’t aware that nested parallelism wasn’t supported by the PGI compiler! I’m not quite sure what you mean when you say that:

The caveat here is that reductions in “routine” wont be available until the 15.x release, only work on CC3.5 or later devices, and limited to use in “vector” or “worker” routines

Is the reduction clause supported or not at the moment (albeit with vector or worker clauses only)?

Regardless, it seems to me that the best way to fully use the device’s computing power is to evaluate “function” outside of the if-else statement for both “a” and “b”, as suggested in my post of Nov 26 (i.e. parallel loop directive within function, store output on device via variable passed by reference).

You’re right though, I have to launch FOUR kernels per “function” execution to do this:

  1. Set “sum” to zero
  2. Perform the parallel loop
  3. Reduction of “sum”
  4. Store the “sum” in “quantity” on the device.

This is very expensive of course.
Is there no way to at least get rid of the kernel launches 1 and 4? Or is the only way to do so to copyin “sum” set to zero on the host, and copyout “sum” to the host to use it there? Is this different in CUDA?

Is the reduction clause supported or not at the moment (albeit with vector or worker clauses only)?

In general the “reduction” clause is supported.

However, reductions on gang loops are typically implemented by creating a partial reduction per gang followed by launching a second kernel to perform the final reduction. This is not technically possible to do when the reduction is contained within a “routine gang”.

For “vector” and “worker” reductions, the outer “gang” loop needs to perform synchronization and the final reduction. However when calling a “routine vector” or “routine worker”, the outer “gang” doesn’t know it needs to do the final reduction. Instead, the reduction needs to be self-contained in the routine. To accomplish this, either we’d need to limit the vector length to 32 (no synchronization is required in a warp) or use hardware instructions only available on NVIDIA devices with compute capability 3.5 or greater. The latter is the support being added to the PGI 15.x compilers.

Is there no way to at least get rid of the kernel launches 1 and 4? Or is the only way to do so to copyin “sum” set to zero on the host, and copyout “sum” to the host to use it there? Is this different in CUDA?

For #1, I’d set it to zero before the data region and then do a copyin, or set it to zero on the host and use the “update” directive.

For #4, currently you would need to put this in a kernel. Note that I made a proposal to the OpenACC committee to add a device-to-device copy runtime library routine which you could use once/if it becomes part of the standard.

  • Mat