PGI accelerator model with OpenMP/MPI

Hi, I am testing how PGI accelerator model works with OpenMP and MPI. I realize that we have to specify statically the number of threads/processes in the code. Here is an example with OpenMP:

Num_GPUs=2;
#pragma omp parallel num_threads(2)
{
acc_set_device_num(omp_get_thread_num()%Num_GPUs, acc_device_default);
low= omp_get_thread_num()*N/2;
high = low + N/2;
#pragma acc region
{
for (i = low; i < high; i++) {…}
}

}

Could someone tell me how to do the similar thing with MPI? The following code does not work.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int Numprocs =2;
Num_GPUs=2;
#pragma acc region
{
acc_set_device_num(rank%Num_GPUs, acc_device_default);
low = rank*(N / Numprocs );
high = low + N/Numprocs ;
#pragma acc region
{
for (i = low; i < high; i++) {…}
}
}

The message is:

93, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
Accelerator region ignored


Thank you.
Tan.

Hi Tan,

Take out the outer acc region and it should work.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int Numprocs =2;
Num_GPUs=2;
#pragma acc region   <<<< Take this out
{
acc_set_device_num(rank%Num_GPUs, acc_device_default);
low = rank*(N / Numprocs );
high = low + N/Numprocs ;
#pragma acc region
{
for (i = low; i < high; i++) {...}
}
}

Hope this helps,
Mat

Sorry for this mistake. In the actual code I just use 1 directive.

#pragma acc region
acc_set_device_num(rank%2, acc_device_default);
low = rank*(N / 2);
high = low + N/2;
for(i=low; i< high; i++)
vector_= vector1 * vector2;
}

As your suggest, I took out the “acc_set_device_num” but the error was still there.

94, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
Accelerator region ignored


Tan._

Just want to correct the code:

#pragma acc region
{
acc_set_device_num(rank%2, acc_device_default);
low = rank*(N / 2);
high = low + N/2;
for(i=low; i< high; i++)
vector = vector1 * vector2;
}

Hi Tan,

You accidentally removed the inner pragma not the outer.

The problem is that the compiler will implicitly copy in your array at the start of an accelerator region. However, the size of the array is computed using the loop bound variables ‘low’ and ‘high’ which are computed within the accelerator region. Hence, the compiler doesn’t know how much of the array to copy over.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int Numprocs =2;
Num_GPUs=2;
acc_set_device_num(rank%Num_GPUs, acc_device_default);
low = rank*(N / Numprocs );
high = low + N/Numprocs ;
#pragma acc region   // implicitly copy the array once low and high are known.
{
for (i = low; i < high; i++) {...}
}

Note that you can also use the copy clauses (‘copy’, ‘copyin’, ‘copyout’) to define how much of the array to copy over.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int Numprocs =2;
Num_GPUs=2;
acc_set_device_num(rank%Num_GPUs, acc_device_default);
low = rank*(N / Numprocs );
high = low + N/Numprocs ;
#pragma acc region  copyin(myarrayname[0:N-1])
{
for (i = low; i < high; i++) {...}
}
  • Mat

Awesome, the pgcc compiler is working fine now. However, I got another problem when liking objects compiled by pgcc using MPI compiler. Could you help me figure it out? Thanks.

mpic++ -O3 -o VectorProduct VectorProduct.o Timer.o

VectorProduct.o(.text+0x18e): In function main': ./VectorProduct.c:55: undefined reference to _mp_malloc’
VectorProduct.o(.text+0x1a0):./VectorProduct.c:55: undefined reference to _mp_malloc' VectorProduct.o(.text+0x3ab):./VectorProduct.c:90: undefined reference to acc_set_device_num’
VectorProduct.o(.text+0x3ca):./VectorProduct.c:90: undefined reference to __pgi_cu_init' VectorProduct.o(.text+0x3de):./VectorProduct.c:90: undefined reference to __pgi_cu_module’
VectorProduct.o(.text+0x3f7):./VectorProduct.c:90: undefined reference to __pgi_cu_module_function' VectorProduct.o(.text+0x41e):./VectorProduct.c:90: undefined reference to __pgi_cu_alloc’
VectorProduct.o(.text+0x442):./VectorProduct.c:90: undefined reference to __pgi_cu_alloc' VectorProduct.o(.text+0x466):./VectorProduct.c:90: undefined reference to __pgi_cu_alloc’
VectorProduct.o(.text+0x4dc):./VectorProduct.c:90: undefined reference to __pgi_cu_uploadp' VectorProduct.o(.text+0x552):./VectorProduct.c:90: undefined reference to __pgi_cu_uploadp’
VectorProduct.o(.text+0x59a):./VectorProduct.c:90: undefined reference to __pgi_cu_uploadc' VectorProduct.o(.text+0x5cb):./VectorProduct.c:97: undefined reference to __pgi_cu_paramset’
VectorProduct.o(.text+0x615):./VectorProduct.c:97: undefined reference to __pgi_cu_launch' VectorProduct.o(.text+0x6a3):./VectorProduct.c:98: undefined reference to __pgi_cu_downloadp’
VectorProduct.o(.text+0x6ba):./VectorProduct.c:98: undefined reference to __pgi_cu_free' VectorProduct.o(.text+0x6cd):./VectorProduct.c:98: undefined reference to __pgi_cu_free’
VectorProduct.o(.text+0x6e0):./VectorProduct.c:98: undefined reference to __pgi_cu_free' VectorProduct.o(.text+0x6e5):./VectorProduct.c:98: undefined reference to __pgi_cu_close’
VectorProduct.o(.text+0x74d):./VectorProduct.c:107: undefined reference to _mp_free' VectorProduct.o(.text+0x755):./VectorProduct.c:107: undefined reference to _mp_free’

You’re missing the PGI runtime libraries on your link. I’m assuming mpic++ is using g++:

mpic++ -O3 -o VectorProduct VectorProduct.o Timer.o  -L/usr/pgi/linux86-64/10.8/bin  -rpath /usr/pgi/linux86-64/2010/cuda/2.3/lib -lacc1 -ldl -lnspgc -lpgc

Note: change the paths to the PGI libraries to match what’s on your system.

Alternatively, if you’re MPI installation is configured to use the PGI drives, you could use ‘mpicc’ and add the “-ta=nvidia” flag to your link line. pgcc will add the correct libraries when the -ta flag is used.

Hope this helps,
Mat

I added the path to the PGI libraries and the linker are working well.
-L/usr/local/pgi/linux86-64/10.0/lib -lacc1 -ldl -lnspgc -lpgc

Thank you for your help.

Tan.