Hybrid runs on CPU and GPU - OpenACC

nitya9271 · May 18, 2022, 6:43am

Hi,

I am looking to setup hybrid runs of an OpenACC + OpenMPI code where I want to run the pure CPU version of the code on some nodes, and the OpenACC version of the code on other nodes on GPUs.

From the discussion in the link below, it looks like ACC_DEVICE_TYPE will run the OpenACC version of the code on the CPU or GPU, please correct me if not so. If that is the case, is it possible to disable the OpenACC pragmas during run time?

Thanks and regards

Nitya

MatColgrove · May 18, 2022, 4:46pm

Hi Nitya,

To toggle the OpenACC offload on or off during runtime, the easiest thing to do is use the “if” clause with a boolean value. Something like:

!$acc parallel loop if(usegpu)

Then set “usegpu” to true or false depending if you want it to run on the GPU or run serially on the host. It does mean that you need the carry around this variable in your code, but hopefully that’s not much of an issue.

If you wanted to switch the target device so the loop runs in parallel using host threads, you would use the API call “acc_set_device_type” to switch between the two. The environment variable “ACC_DEVICE_TYPE” would apply to the whole program. Though be sure to compile with the “-acc=gpu,multicore” flag so both targets are included when creating the binary.

Hope this helps,
Mat

nitya9271 · May 20, 2022, 8:08am

(post deleted by author)

nitya9271 · May 20, 2022, 11:41am

Thanks Mat. I have attached a sample code and the output from compiling, running it as follows

hybrid-test.zip (2.4 KB)

export PATH=/opt/nvhpc/2022_222/Linux_x86_64/22.2/compilers/bin/:$PATH
export LD_LIBRARY_PATH=/opt/nvhpc/2022_222/Linux_x86_64/22.2/compilers/lib:$LD_LIBRARY_PATH
export PATH=/opt/nvhpc/2022_222/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/bin/:$PATH
export LD_LIBRARY_PATH=/opt/nvhpc/2022_222/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib:$LD_LIBRARY_PATH

export MPI_PATH=/opt/nvhpc/2022_222/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/

mpicc -g -traceback -O2 -acc=gpu,multicore -o hybrid hybrid.c

${MPI_PATH}/bin/mpirun -x ACC_DEVICE_TYPE=nil -n 1 --host a100 sh -c './hybrid > out-pure-cpu.txt 2>&1' : -x ACC_DEVICE_TYPE=acc_device_nvidia -n 1 --host a100 sh -c './hybrid > out-gpu.txt 2>&1' : -x ACC_DEVICE_TYPE=acc_device_host -n 1 --host a100 sh -c './hybrid > out-cpu.txt 2>&1'

Please could you take a look at hybrid.c. The output files are

a) out-pure-cpu.txt - Run on CPU, OpenACC switched off with if condition.
b) out-gpu.txt - Run on GPU
c) out-cpu.txt - Run on CPU with OpenACC

Couple of queries

Is there a cleaner way to check what was the value of ACC_DEVICE_TYPE set in the command line, or through any OpenACC API calls?
I find that unsetting ACC_DEVICE_TYPE to run a pure CPU version ends in a seg fault, so I had to set it to nil and compare values inside the code. If I don’t unset it, it’s acc_device_nvidia by default?
What would be the value of the call acc_get_num_devices() be if OpenACC is run on the CPU?

Thanks again for the help.

Regards

Nitya

MatColgrove · May 20, 2022, 5:21pm

Hi Nitya,

Is there a cleaner way to check what was the value of ACC_DEVICE_TYPE set in the command line, or through any OpenACC API calls?

Yes, “acc_get_device_type”. See Section 3.2.3 of the OpenACC Spec: https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf

I find that unsetting ACC_DEVICE_TYPE to run a pure CPU version ends in a seg fault, so I had to set it to nil and compare values inside the code. If I don’t unset it, it’s acc_device_nvidia by default?

Not sure. The segv I’m seeing is in the strcmp. Setting ACC_DEVICE_TYPE to nil basically unsets the variable so I’m not sure what string is returned, most likely nothing. Using the env var to state that no OpenACC should be used is not going to work how you want, so I’d suggest making this a program argument.

What would be the value of the call acc_get_num_devices() be if OpenACC is run on the CPU?

It will return 1.

I’ve updated your code below to show how I’d do this. Pass in a command line option to the program to toggle OpenACC on or off. Then if the code is running on a node with a GPU, it will default to using the GPUs. If no GPUs are present, it will run parallel on the multicore CPUs. You can also set ACC_DEVICE_TYPE=HOST when you want to run multicore CPU on a node with a GPU.

Since you set different sizes for each of the cases, you’ve already identified the problem with this approach, i.e. load balancing the work. You need give the GPUs a larger proportion of the work else the ranks on the multicore CPUs/or serial CPU will become the performance bottle neck. This can be tricky to get correct since the distribution of work will depend on the system you’re using, how the ranks are scheduled on the nodes, and the workload being used. It might be best to put this logic into the program itself as part of the domain decomposition. Not to dissuade you from this, but it’s generally a model I tend to avoid. Instead I have the ranks run either using all GPUs or all CPUs.

Note that I also fixed issues with how you were allocating the device arrays. You were using a “enter data create” so the arrays were only being created on the device, not updated. The “copyin” and “copyout” clauses on the parallel loop will use “present_or” semantics, i.e. copyin/copyout will only occur if the data is not already present. Since it’s already created, no data copies. Plus an “enter data” directive needs a matching “exit data” directive so the data is deleted from the device.

% cat hybrid.c
#include<stdio.h>
#include<stdlib.h>
#include<openacc.h>
#include<string.h>
#include<math.h>
#include<mpi.h>
#include<stdbool.h>

double add_vector(long n, int if_acc);

double add_vector(long n, int if_acc)
{
  // Input vectors
  double *restrict a;
  double *restrict b;
  // Output vector
  double *restrict c;

  // Size, in bytes, of each vector
  size_t bytes = n*sizeof(double);

  // Allocate memory for each vector
  a = (double*)malloc(bytes);
  b = (double*)malloc(bytes);
  c = (double*)malloc(bytes);

  // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
  long i;
  for(i=0; i<n; i++) {
    a[i] = i;
    b[i] = i;
    c[i] = 0;
  }

  // sum component wise and save result into vector c
#pragma acc enter data copyin(a[0:n], b[0:n]) create(c[0:n]) if(if_acc)
#pragma acc kernels present(a,b,c) if(if_acc)
  for(i=0; i<n; i++) {
    c[i] = a[i] + b[i];
  }
#pragma acc exit data copyout(c[:n]) delete(a,b) if(if_acc)

  // Sum up vector c and print result divided by n, this should equal 1 within error
  double sum = 0.0;
  for(i=0; i<n; i++) {
    sum += c[i];
  }
  sum = sum/n;
  return sum;

}

int main(int argc, char **argv)
{
  int ThisTask;

  int if_acc;

  if (argc != 2) {
    printf("Error: missing arg to set if ACC is used\n");
  }
  if_acc = atoi(argv[1]);

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &ThisTask);

  char msg[100];
  long size;

  if (!if_acc) {
     size = 10;
     strcpy(msg,"sum on serial cpu");
  } else {
    int num_devices=0;
    int myDevice;
    MPI_Comm shmcomm;
    int local_rank;
    acc_device_t myDeviceType = acc_get_device_type();


     // Rank to device binding
     MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                         MPI_INFO_NULL, &shmcomm);
     MPI_Comm_rank(shmcomm, &local_rank);
     num_devices = acc_get_num_devices(myDeviceType);
     myDevice = local_rank % num_devices;
     printf("DT=%d ND=%d MyDevice=%d\n",myDeviceType, num_devices, myDevice);
     acc_set_device_num(myDevice,myDeviceType);

    if (myDeviceType == acc_device_host) {
       size = 100;
       strcpy(msg,"sum on muticore cpu");
    } else {
       size = 10000;
       strcpy(msg,"sum on gpu");
    }
  }

   double sum = add_vector(size, if_acc);
   printf("%s SIZE=%d SUM=%f\n", msg, size, sum);

  return 0;
}
ice2:/local/home/mcolgrove/uf/hybrid-test% mpicc -fast -acc=gpu,multicore -Minfo=accel hybrid.c
add_vector:
     38, Generating enter data create(c[:n])
         Generating enter data copyin(b[:n],a[:n])
         Generating present(a[:],c[:],b[:])
         Loop is parallelizable
         Generating NVIDIA GPU code
         38, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     38, Generating Multicore code
         38, #pragma acc loop gang
     44, Generating exit data delete(a[:1])
         Generating exit data copyout(c[:n])
         Generating exit data delete(b[:1])
% mpirun -np 1 ./a.out 0
sum on serial cpu SIZE=10 SUM=9.000000
% unsetenv ACC_DEVICE_TYPE
% mpirun -np 1 ./a.out 1
DT=4 ND=2 MyDevice=0
sum on gpu SIZE=10000 SUM=9999.000000
% setenv ACC_DEVICE_TYPE HOST
% mpirun -np 1 ./a.out 1
DT=2 ND=1 MyDevice=0
sum on muticore cpu SIZE=100 SUM=99.000000

-Mat

nitya9271 · May 23, 2022, 5:38am

Many thanks Mat! This is what I was looking for. And also thanks for clarifying my queries.

I agree the workload distribution is difficult for such hybrid runs. I want to do hybrid runs to validate a workload when sufficient GPU nodes are not available, so performance will not be an issue.

Regards

Nitya

system · June 29, 2022, 11:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.