Hi Nitya,
- Is there a cleaner way to check what was the value of ACC_DEVICE_TYPE set in the command line, or through any OpenACC API calls?
Yes, “acc_get_device_type”. See Section 3.2.3 of the OpenACC Spec: https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf
- I find that unsetting ACC_DEVICE_TYPE to run a pure CPU version ends in a seg fault, so I had to set it to nil and compare values inside the code. If I don’t unset it, it’s acc_device_nvidia by default?
Not sure. The segv I’m seeing is in the strcmp. Setting ACC_DEVICE_TYPE to nil basically unsets the variable so I’m not sure what string is returned, most likely nothing. Using the env var to state that no OpenACC should be used is not going to work how you want, so I’d suggest making this a program argument.
- What would be the value of the call acc_get_num_devices() be if OpenACC is run on the CPU?
It will return 1.
I’ve updated your code below to show how I’d do this. Pass in a command line option to the program to toggle OpenACC on or off. Then if the code is running on a node with a GPU, it will default to using the GPUs. If no GPUs are present, it will run parallel on the multicore CPUs. You can also set ACC_DEVICE_TYPE=HOST when you want to run multicore CPU on a node with a GPU.
Since you set different sizes for each of the cases, you’ve already identified the problem with this approach, i.e. load balancing the work. You need give the GPUs a larger proportion of the work else the ranks on the multicore CPUs/or serial CPU will become the performance bottle neck. This can be tricky to get correct since the distribution of work will depend on the system you’re using, how the ranks are scheduled on the nodes, and the workload being used. It might be best to put this logic into the program itself as part of the domain decomposition. Not to dissuade you from this, but it’s generally a model I tend to avoid. Instead I have the ranks run either using all GPUs or all CPUs.
Note that I also fixed issues with how you were allocating the device arrays. You were using a “enter data create” so the arrays were only being created on the device, not updated. The “copyin” and “copyout” clauses on the parallel loop will use “present_or” semantics, i.e. copyin/copyout will only occur if the data is not already present. Since it’s already created, no data copies. Plus an “enter data” directive needs a matching “exit data” directive so the data is deleted from the device.
% cat hybrid.c
#include<stdio.h>
#include<stdlib.h>
#include<openacc.h>
#include<string.h>
#include<math.h>
#include<mpi.h>
#include<stdbool.h>
double add_vector(long n, int if_acc);
double add_vector(long n, int if_acc)
{
// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);
// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
long i;
for(i=0; i<n; i++) {
a[i] = i;
b[i] = i;
c[i] = 0;
}
// sum component wise and save result into vector c
#pragma acc enter data copyin(a[0:n], b[0:n]) create(c[0:n]) if(if_acc)
#pragma acc kernels present(a,b,c) if(if_acc)
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
#pragma acc exit data copyout(c[:n]) delete(a,b) if(if_acc)
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0.0;
for(i=0; i<n; i++) {
sum += c[i];
}
sum = sum/n;
return sum;
}
int main(int argc, char **argv)
{
int ThisTask;
int if_acc;
if (argc != 2) {
printf("Error: missing arg to set if ACC is used\n");
}
if_acc = atoi(argv[1]);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &ThisTask);
char msg[100];
long size;
if (!if_acc) {
size = 10;
strcpy(msg,"sum on serial cpu");
} else {
int num_devices=0;
int myDevice;
MPI_Comm shmcomm;
int local_rank;
acc_device_t myDeviceType = acc_get_device_type();
// Rank to device binding
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &shmcomm);
MPI_Comm_rank(shmcomm, &local_rank);
num_devices = acc_get_num_devices(myDeviceType);
myDevice = local_rank % num_devices;
printf("DT=%d ND=%d MyDevice=%d\n",myDeviceType, num_devices, myDevice);
acc_set_device_num(myDevice,myDeviceType);
if (myDeviceType == acc_device_host) {
size = 100;
strcpy(msg,"sum on muticore cpu");
} else {
size = 10000;
strcpy(msg,"sum on gpu");
}
}
double sum = add_vector(size, if_acc);
printf("%s SIZE=%d SUM=%f\n", msg, size, sum);
return 0;
}
ice2:/local/home/mcolgrove/uf/hybrid-test% mpicc -fast -acc=gpu,multicore -Minfo=accel hybrid.c
add_vector:
38, Generating enter data create(c[:n])
Generating enter data copyin(b[:n],a[:n])
Generating present(a[:],c[:],b[:])
Loop is parallelizable
Generating NVIDIA GPU code
38, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
38, Generating Multicore code
38, #pragma acc loop gang
44, Generating exit data delete(a[:1])
Generating exit data copyout(c[:n])
Generating exit data delete(b[:1])
% mpirun -np 1 ./a.out 0
sum on serial cpu SIZE=10 SUM=9.000000
% unsetenv ACC_DEVICE_TYPE
% mpirun -np 1 ./a.out 1
DT=4 ND=2 MyDevice=0
sum on gpu SIZE=10000 SUM=9999.000000
% setenv ACC_DEVICE_TYPE HOST
% mpirun -np 1 ./a.out 1
DT=2 ND=1 MyDevice=0
sum on muticore cpu SIZE=100 SUM=99.000000
-Mat