Openacc `vector_length` changes the result

fredpz · October 16, 2024, 11:43am

I have the following code that sums over several worker-local arrays:

#include <iostream>
#include <stdlib.h>

int main() {
    int tot = 0;
    
    #pragma acc parallel loop gang worker vector_length(32) reduction(+:tot)
    for( int i = 0; i < 100; i++ ){
        int arr[32];
        #pragma acc data create(arr[:32])
        
        #pragma acc loop vector
        for( int j = 0; j < 32; j++ ){ arr[j] = i + j; }
        
        int s = 0;
        #pragma acc loop vector reduction(+:s)
        for( int j = 0; j < 32; j++ ){ s += arr[j]; }
        
        tot += s;
    }
    
    printf( "%d\n", tot );
    return 0;
}

It does not provide the correct result unless I remove either the worker or the vector_length(32) clauses. I can’t make sense of it as these clauses should only change the way the loops are split between resources. But somehow it seems to duplicate some of the loop iterations.

If I’m correct, arr is private to the worker, so there should be no atomics needed. And the result, although wrong, it reproducible.

Is there anything wrong here?

MatColgrove · October 16, 2024, 5:53pm

It’s a known issue (TPR#31948). Early in the compilation phase (before OpenACC is applied), the local array declarations gets hoisted out of the loop. If the loop was using just “gang”, then it can automatically privatize the array. However it doesn’t if multiple parallel levels are used (“gang worker” in this case).

The workaround is to manually hoist the array and then explicitly add it to a “private” clause.

I added your example to the previous report.

% cat test.cpp
#include <iostream>
#include <stdlib.h>

int main() {
    int tot = 0;
    int arr[32];

    #pragma acc parallel loop gang worker vector_length(32) reduction(+:tot) private(arr)
    for( int i = 0; i < 100; i++ ){
//        #pragma acc data create(arr[:32])  << remove as data regions can only be used from the host

        #pragma acc loop vector
        for( int j = 0; j < 32; j++ ){ arr[j] = i + j; }

        int s = 0;
        #pragma acc loop vector reduction(+:s)
        for( int j = 0; j < 32; j++ ){ s += arr[j]; }

        tot += s;
    }

    printf( "%d\n", tot );
    return 0;
}
% nvc++ -Minfo=accel test.cpp
% a.out
208000

system · October 30, 2024, 5:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4092	August 20, 2012
OpenACC Gang-Vector Performance Legacy PGI Compilers	4	3720	June 18, 2015
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25160	December 11, 2014
Privatization of array Legacy PGI Compilers	9	17662	July 14, 2010
Wrong results when using vector clause in parallel loop with array syntax nvc, nvc++ and nvfortran	4	940	February 17, 2023
a 3 levels of loop Legacy PGI Compilers	1	2085	September 6, 2012
openacc workshare?! Legacy PGI Compilers	4	4409	February 1, 2017
How to parallelize this loop... Legacy PGI Compilers	14	7883	December 18, 2012
vector_length() ignored by compiler? Legacy PGI Compilers	3	2651	December 9, 2015
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	11007	November 27, 2013

Openacc `vector_length` changes the result

Related topics