How to parallelize this loop...

Hello,

can somebody please tell me how to parallelize the inner loop using the kernels construct?
The compiler keeps telling me that it is parallelizable but it refuses to parallelize it.

inline void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
	int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict float *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent
        for(i=0; i<n; i++){
            float tmp = 0.0;
#pragma acc loop independent reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}

The output is:

matvec:                                                                        
     54, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     57, Loop is parallelizable
         Accelerator kernel generated
         57, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Cached references to size [(x+1)] block of 'ptr'
             CC 2.0 : 23 registers; 0 shared, 100 constant, 0 local memory bytes
     60, Loop is parallelizable

If I read the compiler feedback correctly this parallel version parallelized the inner loop as well:

inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
	int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict floatType *value = A->value;

#pragma acc parallel present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n]) vector_length(32)
    {
#pragma acc loop gang
        for(i=0; i<n; i++){
            floatType tmp = 0.0;
#pragma acc loop vector reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}

This is the respective compiler feedback:

matvec:
     54, Accelerator kernel generated
         54, CC 2.0 : 20 registers; 32 shared, 92 constant, 0 local memory bytes
         57, #pragma acc loop gang /* blockIdx.x */
         60, #pragma acc loop vector(32) /* threadIdx.x */
     54, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     60, Loop is parallelizable

Best,
Paul

Hi Paul,

Since the compiler has scheduled the outer loop as a “gang, vector(128)”, it ignores the inner loop directive. Scheduling outer loop as the vector makes in impossible to perform the parallel sum reduction in the inner loop. You can override the compiler’s choice by explicitly setting the inner loop schedule to use a “vector”.

Try both schedules and see which one is optimal for your code. I’ve seen both but more often find parallelizing just the outer loop faster.

Hope this helps,
Mat

% cat test.c
struct MatrixCRS {
int n;
int nnz;
int * ptr;
int * index;
float * value;
};


void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
   int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict float *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent 
        for(i=0; i<n; i++){
            float tmp = 0.0;
#pragma acc loop independent vector reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
} 
% pgcc -c test.c -Minfo=accel -Msafeptr -acc
matvec:
     18, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     21, Loop is parallelizable
         Accelerator kernel generated
         21, #pragma acc loop gang /* blockIdx.x */
             CC 1.0 : 16 registers; 96 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 16 registers; 32 shared, 92 constant, 0 local memory bytes
         24, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable

Hi Mat,

thanks for your quick help. Unfortunately the compiler still tries to schedule the outer loop among vectors. However it seems to parallelize the inner loop.

inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
	int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict floatType *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent 
        for(i=0; i<n; i++){
            floatType tmp = 0.0;
#pragma acc loop independent vector reduction(+:tmp)
//#pragma unroll(32)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}     53, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     56, Loop is parallelizable
         Accelerator kernel generated
         56, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             Cached references to size [(x+1)] block of 'ptr'
             CC 2.0 : 22 registers; 16 shared, 92 constant, 0 local memory bytes
         60, #pragma acc loop vector(32) /* threadIdx.y */
         Loop is parallelizable

This is still true if I add “gang” to the outer loop.

I’m using pgi 12.8 with the following options:
-Minfo=accel -acc -ta=nvidia,4.1,cc20

Best,
Paul

Hi Paul,

I’m using pgi 12.8 with the following options

I was using 12.9 and hence the difference.

What’s the performance delta between the various schedules? Are there any correctness issues?

  • Mat

Hi Mat,

I the Version which parallelizes the inner loop runs for 4.5 seconds while the version which does not
parallelize the inner loop takes 6.15 seconds. If I change the vector_length from 256 (chosen by the compiler)
to 32, I was able to increase the runtime to 2.67 seconds. So it makes quite a difference.

Moreover, the last version I’ve posted does not return correct results even though the compiler does not
issue any warnings.

Best,
Paul

Hi,

here is yet another loop I’m having problems with. The compiler keeps telling
me that it is has loop-carried dependencies despite the fact that I’m
using the restirct key word.

#pragma acc parallel loop gang private(i,j) present(X[0:n*n], XNext[0:n*n]) vector_length(32)
    for(i=1;i < n-1 ;++i){
        double max = 0.0;

#pragma acc loop vector
        for(j=1;j < n-1 ;++j){

            //average the four neighbouring vertices
            XNext[IDX(i,j,n)] = (X[IDX(i+1,j,n)] +  X[IDX(i-1,j,n)] + X[IDX(i,j+1,n)] + X[IDX(i,j-1,n)])/4.0;

            //find maximum difference between old and new solution
            max = fabs(XNext[IDX(i,j,n)] - X[IDX(i,j,n)]);
            if(delta < max)
                delta = max;
        }
    }

    return delta;
}

Here is the compiler feedback:

22, Accelerator kernel generated
         22, CC 2.0 : 21 registers; 0 shared, 100 constant, 0 local memory bytes
         23, #pragma acc loop gang /* blockIdx.x */
         27, #pragma acc loop vector(32) /* threadIdx.x */
         34, Max reduction generated for delta
     22, Generating present(XNext[0:n*n])
         Generating present(X[0:n*n])
         Generating compute capability 2.0 binary
     27, Complex loop carried dependence of '*(X)' prevents parallelization
         Complex loop carried dependence of '*(XNext)' prevents parallelization

It seems that this loop has been parallelized but why is there this dependence? If I compile with -Msafeptr these dependencies disapear and
the program runs a little faster (14,7 sec vs 15,9sec, both with acc_init()).

I’m very thankful for any advice.

Best,
Paul

Hi Paul,

How are you using “restrict”? Often users will put it in wrong place changing it’s meaning. For example, this is correct

float * restrict A;

But this is incorrect.

float restrict * A;
  • Mat

Hi Mat,

thank you for the clarification, it’s working.

Best,
Paul

Hi,

I recently compared the performance of the sparse matrix-vector multiplication (see page 1) with my NVIDIA CUDA implementation and it turned out that the
OpenACC version only performs at ~30% of the CUDA perfromance.

The OpenACC implementation follows the design of the CUDA version very closesly. Moreover, both versions launch the same number of threadblocks, threads and require the same amount of registers (i.e. 20 registers per thread).

Do you have a guess why there is such a big performance delta between the two implementations?

Best,
Paul

Hi Paul,

What is the output from the profile information and how does it compare to the CUDA run? (i.e. set the environment variable PGI_ACC_TIME=1). There are three things to compare: data movement, initialization time, and kernel execution time. Which one are you comparing or are you looking at total time?

  • Mat

Hi Mat,

This kernel is part of a Conjugat Gradient (CG) Method and gets called several times.
I’m only measuring the time for the CG method to complete (i.e. I’m using acc_init() before starting to measure the time). Moreover, a data region keeps the data on the device and avoids unnecessary data transfers between host and device.

Here is the profiling feedback for the OpenACC version:

  matvec
    53: region entered 1631 times
        time(us): total=1,533,762 init=109 region=1,533,653
                  kernels=1,508,950
        w/o init: total=1,533,653 max=1,060 min=937 avg=940
        53: kernel launched 1631 times
            grid: [16614]  block: [32]
            time(us): total=1,508,950 max=954 min=922 avg=925

Profiling the CUDA and OpenACC version with NVIDIAs Visual Profiler shows that the CUDA implementation of the sparse matrix vector multiplication (SpMV) takes roughly 1/3 of the time of its OpenACC counterpart.

An additional note: My CUDA version is more or less a trivial translation of the OpenACC directives/clauses to CUDA.

Best,
Paul

Thanks Paul, so the difference is just with the kernel. Things to look for are how the memory is being accessed, how the inner loop reduction is being performed, if caching is being used (and if it’s being used effectively).

Some quick ideas would be to:

  1. disable auto-caching (-ta=nvidia,nocache)
  2. Using a longer vector. 32 may be too small, try 128 or 256.
  3. Accelerate just the outer loop and not perform the inner loop reduction.

My best guess would be that the overhead of the reduction code, especially with a small vector length, is hurting your performance.

If none of these seem to help, probably best for you to send me both codes so I can determine the difference.

  • Mat

Hi Mat,

thanks for your advice. I’ll have a closer look at it tomorrow…

Some quick remarks:
(1) I had the same idea because the Visual Profiler showed that OpenACC uses slightly more memory per block. However, disabling caching did not affect the performance at all.

(2) The CUDA version is using exactly the same schedule (i.e. 1 threadblock for each row and 32 threads for each threadblock). Increasing this value to, let’s say 64, decreases performance because most rows do not even have 64 non-zeros.

(3) I’ll try that tmrw but that would differ from the CUDA schedule (i.e. it might result in poor performance)

Best,
Paul

Hi again,

here are some final remarks:

(1) The nocache option has no effect at the runtime.

(2) The default compiler choice is 256 and its performance is way worse than the manual choice of 32, same goes for 128.

(3) Only parallelizing the outer loop (i.e. #pragma acc loop seq for the inner loop and leaving the schedule of the outer loop to the compiler) results in poor performance as well.

I’m very curious why this is the case. Can you tell me how you would profile this such that I can give it a try?

In any case, if you want I can send you the source code.

UPDATE: I’ve prepared the .tar archive containing the sorce code and the input data (36MB). How do you want me to send it, just via mail?

Best,
Paul

Hi Paul,

I’ve prepared the .tar archive containing the sorce code and the input data (36MB). How do you want me to send it, just via mail?

That’s probably too big to send via email. Can you FTP it? (https://www.pgroup.com/support/ftp_access.php). Please let me know when it’s so I can bug our web master to get it for me.

I’m very curious why this is the case. Can you tell me how you would profile this such that I can give it a try?

The next step would be to use CUDA Profiling and hardware counters to get a better idea where the perform differences occur. I’ll also compare the CUDA we’re generating with your hand tuned CUDA.

Thanks,
Mat