I am trying to port the following code to the GPU, keeping all data in GPU memory space:
if (function(array,a) > x){
out = y;
} else if (function(array,b) > x){
out = z;
}
To keep the result of function(array,a) and function(array,b) on the GPU, I must port the entire block to the GPU. function() loops over the array elements and manipulates them with the integer a. It has a pragma acc loop directive to parallelize the loop. If I compile function() on the GPU using the OpenACC routine directive (with no gang, worker or vector clause), I could do the following:
#pragma acc parallel copyin(array,a,b,x,y,z) copyout(out) num_gangs(1) vector_length(1)
{
... same code as on the CPU ...
}
But this code is really slow. I believe this is because the num_gangs(1) vector_length(1) clause is being applied within the device routine function(), which is therefore executed sequentially by a single thread. The following is much faster:
#pragma acc enter data copyin(array,a,b,x,y,z) create(s1,s2,out)
#pragma acc parallel present(array,a,b,s1,s2)
{
s1 = function(array,a);
s2 = function(array,b);
}
#pragma acc parallel present(s1,s2,x,y,z,out)
{
if (s1 > x){
out = y;
} else if (s2 > x){
out = z;
}
}
#pragma acc exit data delete(array,a,b,x,y,z,s1,s2) copyout(out)
However, this requires evaluating function(array,b) where it is a priori not necessary. How can I best port my if-else statement and function() evaluations to the GPU?