Reduction prevents parallel execution on two GPUs

sWienke · March 6, 2014, 8:23am

Hi,
Again I use a very simple Jacobi solver to solve the equation on two GPUs. I use acc_set_device_num(0,acc_device_nvidia) and acc_set_device_num(1,acc_device_nvidia) for setting what should be executed on device 0 and device 1. I do not use several MPI or OpenMP processes/threads to do so. I distribute the rows of the matrix to the both devices.
However, the first Jacobi computation need a reduction (see below). This reduction prevents that both parts are executed in parallel on both GPUs. I do not understand why. If I uncomment the reduction computation for device 0, I can see in the NVIDIA Profiler that both kernels run in parallel as you would expect.
Is there a workaround for this issue? I tried the 14.1 and 14.2 compilers.
Sandra

    // j_cpu_start is the number of rows to compute on CPU
    // distribute other rows to two GPUs
    // put data on GPU 0 and 1
    while ( err > tol && iter < iter_max ) {

        acc_err0 = 0.0;
        acc_err1 = 0.0;
        err = 0.0;

        // do one part on GPU
        acc_set_device_num(0,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m]) async reduction(max:acc_err0)
        for( j = 1; j < j_cpu_start/2; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        acc_set_device_num(1,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m]) async reduction(max:acc_err1)
        for( j = j_cpu_start/2; j < j_cpu_start; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        // do other part on CPU
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err)
        for( j = j_cpu_start; j<n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }
        // exchange halo
        // do matrix swap
    }

MatColgrove · March 6, 2014, 7:22pm

Hi Sandra,

It’s probably not the reduction itself but rather the copying back of the reduction variable to the host that’s causing the problem. Try putting acc_err0 and acc_err1 into a data region and then manage when the variables are copied back using the update directive. Something like:

   // j_cpu_start is the number of rows to compute on CPU 
    // distribute other rows to two GPUs 
    // put data on GPU 0 and 1 
    while ( err > tol && iter < iter_max ) { 

        acc_err0 = 0.0; 
        acc_err1 = 0.0; 
        err = 0.0; 
       acc_set_device_num(1,acc_device_nvidia);
       #pragma acc update device(acc_err1)

        // do one part on GPU 
        acc_set_device_num(0,acc_device_nvidia); 
       #pragma acc update device(acc_err0)
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m], acc_err0) async reduction(max:acc_err0) 
        for( j = 1; j < j_cpu_start/2; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        acc_set_device_num(1,acc_device_nvidia); 
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m], acc_err1) async reduction(max:acc_err1) 
        for( j = j_cpu_start/2; j < j_cpu_start; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        // do other part on CPU 
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err) 
        for( j = j_cpu_start; j<n-1; j++) { 
            for( i = 1; i < m-1; i++ ) { 
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)] 
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]); 
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i])); 
            } 
        } 

        #pragma acc update host(acc_err1)
        acc_set_device_num(0,acc_device_nvidia); 
        #pragma acc update device(acc_err0)

        // exchange halo 
        // do matrix swap 
    }

sWienke · March 7, 2014, 8:00am

Hi Mat,
I believe you have a copy-paste error in your code. I assume that the last “update device” should actually be a “update host”.
First, I did not see that acc_err0/1 are denoted in the present clauses of the kernels. The consequence was (1) I got wrong results and (2) the kernels did not run in parallel. But, finally, I figured out that they definitely have to be in the present clauses.
With these changes, the code is finally running fast and correct.
But, I see this rather as a workaround (thanks that this is working now!)… or is it really how it should work? I rather see this as compiler issue, isn’t it?

BTW: I used unstructured data lifetimes (see below) for putting data on both devices. How can you use data regions with two devices? Must they be nested?

    acc_set_device_num(0,acc_device_nvidia);
    #pragma acc enter data create(Anew[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data copyin(A[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data create(acc_err0)

    acc_set_device_num(1,acc_device_nvidia);
    #pragma acc enter data create(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data copyin(A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data create(acc_err1)

Thanks, Mat, for your help.
Sandra

MatColgrove · March 7, 2014, 5:57pm

I assume that the last “update device” should actually be a “update host”.

Yes, thanks for catching that. I didn’t actually test the code since I didn’t have the complete example.

But, I see this rather as a workaround (thanks that this is working now!)… or is it really how it should work? I rather see this as compiler issue, isn’t it?

The OpenACC spec states that reduction results are available after the compute region. Since both acc_err0 and acc_err1 were host variables, their values needed to get updated immediately after the compute region.

I think what you’re saying is that since this kernel has the async clause, the compiler should delay copying back the reduction’s value. The question would be when does it get copied back? The compiler can’t guess so needs a well defined spot and the only well defined spot is immediately after the kernel ends. Ideally, there’d be some call back from the device, but this doesn’t exist so the host needs to block waiting for the reduction variable to come back.

Of course, we’re open to suggestions!

Mat

sWienke · March 10, 2014, 2:39pm

Dear Mat,
I see the point. But sometimes, you (as user) just miss a tiny detail in the specification and then you get confused.
Nevertheless, the reduction in combintation with the async clause suggests a different behavior. Why can’t you defer the copy back to a acc wait? That is what I used anyway before updating. Or at least a warning during compilation…
Sandra

MatColgrove · March 11, 2014, 12:01am

Why can’t you defer the copy back to a acc wait?

“wait” is optional or could be in a different area of code.

But sometimes, you (as user) just miss a tiny detail in the specification and then you get confused.

Understood. This is a topic that the compiler folks here have thought about with no clear solution. Though, I’m wondering if we could issue a warning when async is used that would be inhibited by data synchronization…Let me check.

Mat

Topic		Replies	Views
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10949	November 27, 2013
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13511	December 21, 2012
OpenACC directives and GPU Legacy PGI Compilers	32	12754	August 26, 2016
Porting if-else statements and function evaluation to GPU Legacy PGI Compilers	7	8447	December 4, 2014
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10415	May 2, 2014
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15862	July 19, 2013
data region problem Legacy PGI Compilers	4	4963	August 3, 2010
Parallel (async) execution of an OpenACC loop on multiple GPUs is not working when added a nested seq loop (Fortran) nvc, nvc++ and nvfortran	2	853	November 18, 2022
Clarification of reduction variables with implicit copy Legacy PGI Compilers	7	6229	February 14, 2020
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1026	April 22, 2022

Reduction prevents parallel execution on two GPUs

Related topics