How to compile if functions defined elsewhere

Dear all,

I am learning OpenACC with an example: compute pi using fractional method, and do it multiple times. So there are two for-loops can be accelerated.

I tried 1) to write everything in the same file

pi.cpp

:

#include <iostream>
#include <chrono>

#pragma acc routine worker
double fracpi(long N);

using namespace std;

int main() {
    long i, j, ntrial, N;
    double pi[10];
    N = 1000000000;
    ntrial = 10;
    cout << "Pi Computing Test" << endl;
    auto t0 = std::chrono::high_resolution_clock::now();

#pragma acc parallel
#pragma acc loop gang
    for (j = 0; j < ntrial; j++) {
        pi[j] = fracpi(N);
    }

    auto t1 = std::chrono::high_resolution_clock::now();
    for (i = 0; i < ntrial; i++) {
        cout << "Pi: " << pi[i] << endl;
    }
    cout << "Loop OpenACC Elapse time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count()
         << endl;
    return 0;
}

double fracpi(long N) {
    long i;
    double piN;
    double pi;
    piN = 0.0;
#pragma acc loop worker reduction(+:piN)
    for (i = 0; i < N; i++) {
        double t = (double) ((i + 0.05) / N);
        piN += 4.0 / (1.0 + t * t);
    }
    pi = piN / N;
    return pi;
}

and compiled with command

pgc++ -acc -std=c++11 -Minfo=all -Mcuda=cuda8.0 -ta=tesla,cuda8.0 -mp pi.cpp -o pi.x

which can be compiled successively with message

main:
25, Generating implicit copyout(pi[:])
Accelerator kernel generated
Generating Tesla code
25, #pragma acc loop gang /* blockIdx.x /
fracpi(long):
39, Generating Tesla code
47, #pragma acc loop vector, worker /
threadIdx.x threadIdx.y */
39, Generating acc routine worker
reduction in routine disables compute capability 2.0 kernel
47, Loop is parallelizable

However, I tried to 2) put main output in

main.cpp


#include <iostream>
#include <chrono>
#include "compute.h"
using namespace std;

int main() {
    long i, j, ntrial, N;
    double pi[10];
    N = 1000000000;
    ntrial = 10;
    cout << "Pi Computing Test" << endl;
    auto t0 = std::chrono::high_resolution_clock::now();

#pragma acc parallel
#pragma acc loop gang
    for (j = 0; j < ntrial; j++) {
        pi[j] = fracpi(N);
    }

    auto t1 = std::chrono::high_resolution_clock::now();
    for (i = 0; i < ntrial; i++) {
        cout << "Pi: " << pi[i] << endl;
    }
    cout << "Loop OpenACC Elapse time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count()
         << endl;
    return 0;
}

and function in

compute.h


#pragma acc routine worker
double fracpi(long N);



compute.cpp


#include "compute.h"
double fracpi(long N) {
    long i;
    double piN;
    double pi;
    piN = 0.0;
#pragma acc loop worker
    for (i = 0; i < N; i++) {
        double t = (double) ((i + 0.05) / N);
        piN += 4.0 / (1.0 + t * t);
    }
    pi = piN / N;

    return pi;
}

and compiled with command

pgc++ -acc -std=c++11 -Minfo=all -Mcuda=cuda8.0 -ta=tesla,cuda8.0 -mp main.cpp compute.cpp -o pi.x

, it does not work. The error message is

compute.cpp:
fracpi(long):
7, Generating Tesla code
15, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y /
7, Generating acc routine worker
reduction in routine disables compute capability 2.0 kernel
15, Loop is parallelizable
main.cpp:
main:
19, Generating implicit copyout(pi[:])
Accelerator kernel generated
Generating Tesla code
19, #pragma acc loop gang /
blockIdx.x */
nvlink error : Undefined reference to ‘_Z6fracpil’ in ‘main.o’
pgacclnk: child process exit status 2: pgnvd

So would you please help me in
A) How to compile 2) properly?
B) When I increase my N to N = 10,000,000,000, and run ./pi.x, an error happens:

Pi Computing Test

Message from syslogd@czgpu2 at Apr 19 19:31:17 …
kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [a.out:12334]

Accelerator Kernel Timing data
/usr/include/c++/4.8.5/iostream
main NVIDIA devicenum=0
time(us): 0
25: data region reached 1 time
25: compute region reached 1 time
25: kernel launched 1 time
grid: [10] block: [32x8]
device time(us): total=0 max=0 min=0 avg=0

. It seems that I did not allocate my grid/block/thread properly. would you please give me some advices in allocating grid/block/thread?

Thank you very much!

Sincerely,
Lin

Hi Lin,

This appears to be an issue in 16.10 where the C++ compiler wasn’t handling the naming of “routine worker” and “routine vector” functions correctly when placed in an external file. The problem has since been fixed in 17.1 (or the soon to be released 17.4 if your using the Community Edition). However, it looks like there’s a different issue in 17.4 which I’ve reported as TPR#24182. Though there’s an easy work around which is to add scoping brackets around your parallel region. (Sorry, no work around for the 16.10 issue other than to update the compiler).

#pragma acc parallel
{
 #pragma acc loop gang
     for (j = 0; j < ntrial; j++) {
         pi[j] = fracpi(N);
     }
}



% pgc++ -acc -std=c++11 -Minfo=all -ta=tesla:cuda8.0 compute.cpp main.cpp -o pi.x -V16.10
compute.cpp:
fracpi(long):
      3, Generating Tesla code
          9, #pragma acc loop vector /* threadIdx.x */
      3, Generating acc routine vector
         reduction in routine disables compute capability 2.0 kernel
      9, Loop is parallelizable
main.cpp:
main:
     16, Generating implicit copyout(pi[:])
         Accelerator kernel generated
         Generating Tesla code
         18, #pragma acc loop gang /* blockIdx.x */
nvlink error   : Undefined reference to '_Z6fracpil' in 'main.o'
pgacclnk: child process exit status 2: /proj/pgi/linux86-64/16.10/bin/pgnvd
% pgc++ -acc -std=c++11 -Minfo=all -ta=tesla:cuda8.0 compute.cpp main.cpp -o pi.x -V17.4
compute.cpp:
fracpi(long):
      3, Generating Tesla code
          9, #pragma acc loop vector /* threadIdx.x */
      9, Loop is parallelizable
         FMA (fused multiply-add) instruction(s) generated
main.cpp:
main:
     16, Accelerator kernel generated
         Generating Tesla code
         18, #pragma acc loop gang /* blockIdx.x */
     16, Generating implicit copyout(pi[:])

For the second issue, I’m not able to reproduce and the code runs fine for me. Although it takes a bit longer. It looks like your OS kernel is killing your job since it thinks it’s stuck. I’m not sure what limit you need to have increased so I’d advise talking with your local system administrator.

-Mat

Hi Mat,

Thank you very much.

Yes I am using Community Edition v16.10, I’ll wait for the CE v17.4.
And about the OS kernel issue, I am discussing this with local administrator, and will keep forum readers updated when I get the answer.

Sincerely,
Lin

Hi Lin,

The scoping issues should be resolved in 18.4 CE.

  • Alex