Reduction prevents parallel execution on two GPUs

Hi,
Again I use a very simple Jacobi solver to solve the equation on two GPUs. I use acc_set_device_num(0,acc_device_nvidia) and acc_set_device_num(1,acc_device_nvidia) for setting what should be executed on device 0 and device 1. I do not use several MPI or OpenMP processes/threads to do so. I distribute the rows of the matrix to the both devices.
However, the first Jacobi computation need a reduction (see below). This reduction prevents that both parts are executed in parallel on both GPUs. I do not understand why. If I uncomment the reduction computation for device 0, I can see in the NVIDIA Profiler that both kernels run in parallel as you would expect.
Is there a workaround for this issue? I tried the 14.1 and 14.2 compilers.
Sandra

    // j_cpu_start is the number of rows to compute on CPU
    // distribute other rows to two GPUs
    // put data on GPU 0 and 1
    while ( err > tol && iter < iter_max ) {

        acc_err0 = 0.0;
        acc_err1 = 0.0;
        err = 0.0;

        // do one part on GPU
        acc_set_device_num(0,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m]) async reduction(max:acc_err0)
        for( j = 1; j < j_cpu_start/2; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        acc_set_device_num(1,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m]) async reduction(max:acc_err1)
        for( j = j_cpu_start/2; j < j_cpu_start; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        // do other part on CPU
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err)
        for( j = j_cpu_start; j<n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }
        // exchange halo
        // do matrix swap
    }

Hi Sandra,

It’s probably not the reduction itself but rather the copying back of the reduction variable to the host that’s causing the problem. Try putting acc_err0 and acc_err1 into a data region and then manage when the variables are copied back using the update directive. Something like:

   // j_cpu_start is the number of rows to compute on CPU 
    // distribute other rows to two GPUs 
    // put data on GPU 0 and 1 
    while ( err > tol && iter < iter_max ) { 

        acc_err0 = 0.0; 
        acc_err1 = 0.0; 
        err = 0.0; 
       acc_set_device_num(1,acc_device_nvidia);
       #pragma acc update device(acc_err1)

        // do one part on GPU 
        acc_set_device_num(0,acc_device_nvidia); 
       #pragma acc update device(acc_err0)
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m], acc_err0) async reduction(max:acc_err0) 
        for( j = 1; j < j_cpu_start/2; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        acc_set_device_num(1,acc_device_nvidia); 
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m], acc_err1) async reduction(max:acc_err1) 
        for( j = j_cpu_start/2; j < j_cpu_start; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        // do other part on CPU 
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err) 
        for( j = j_cpu_start; j<n-1; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        #pragma acc update host(acc_err1)
        acc_set_device_num(0,acc_device_nvidia); 
        #pragma acc update device(acc_err0)

        // exchange halo 
        // do matrix swap 
    }

Hi Mat,
I believe you have a copy-paste error in your code. I assume that the last “update device” should actually be a “update host”.
First, I did not see that acc_err0/1 are denoted in the present clauses of the kernels. The consequence was (1) I got wrong results and (2) the kernels did not run in parallel. But, finally, I figured out that they definitely have to be in the present clauses.
With these changes, the code is finally running fast and correct.
But, I see this rather as a workaround (thanks that this is working now!)… or is it really how it should work? I rather see this as compiler issue, isn’t it?

BTW: I used unstructured data lifetimes (see below) for putting data on both devices. How can you use data regions with two devices? Must they be nested?

    acc_set_device_num(0,acc_device_nvidia);
    #pragma acc enter data create(Anew[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data copyin(A[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data create(acc_err0)

    acc_set_device_num(1,acc_device_nvidia);
    #pragma acc enter data create(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data copyin(A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data create(acc_err1)

Thanks, Mat, for your help.
Sandra

I assume that the last “update device” should actually be a “update host”.

Yes, thanks for catching that. I didn’t actually test the code since I didn’t have the complete example.

But, I see this rather as a workaround (thanks that this is working now!)… or is it really how it should work? I rather see this as compiler issue, isn’t it?

The OpenACC spec states that reduction results are available after the compute region. Since both acc_err0 and acc_err1 were host variables, their values needed to get updated immediately after the compute region.

I think what you’re saying is that since this kernel has the async clause, the compiler should delay copying back the reduction’s value. The question would be when does it get copied back? The compiler can’t guess so needs a well defined spot and the only well defined spot is immediately after the kernel ends. Ideally, there’d be some call back from the device, but this doesn’t exist so the host needs to block waiting for the reduction variable to come back.

Of course, we’re open to suggestions!

  • Mat

Dear Mat,
I see the point. But sometimes, you (as user) just miss a tiny detail in the specification and then you get confused.
Nevertheless, the reduction in combintation with the async clause suggests a different behavior. Why can’t you defer the copy back to a acc wait? That is what I used anyway before updating. Or at least a warning during compilation…
Sandra

Why can’t you defer the copy back to a acc wait?

“wait” is optional or could be in a different area of code.

But sometimes, you (as user) just miss a tiny detail in the specification and then you get confused.

Understood. This is a topic that the compiler folks here have thought about with no clear solution. Though, I’m wondering if we could issue a warning when async is used that would be inhibited by data synchronization…Let me check.

  • Mat