MatMul with openACC

Hello,

Does someone have a simple matrix multiplication example that runs with pgcc on the gpu? I’ve tried a lot of things but it seems like I always get stuck with the parallelization of the innermost loop (either invalid loops (if I try to create a private array to resolve dependence), or cannot be parallelized/vectorized error). The code does run though if I compile it normally (not for an accelerator).


Any help would be greatly appreciated. I’m a newbie trying to learn how to use openACC for accelerating my code on the GPU.

Hi Tapasya Patki,

Here’s some examples using the PGI Accelerator Model. I’ll post the OpenACC versions next. The major difference is that OpenACC only allows for 1-D arrays so “float **” need to be translated to “float *” and the “copy” clauses in OpenACC use starting bounds plus the number of elements to copy.

$ cat MatrixMultiplication.c
#define M 1024
#define N 2048
#define P 4096

/* muliply a M lines, P columns matrix by a P lines, N columns matrix
   to produce a M lines, N columns matrix */
void
MatrixMultiplication1(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N]) 
{
    int i, j, k ;

#pragma acc region for parallel, vector(8)
    for (i=0; i<M; i++) {
#pragma acc for parallel, vector(8)
        for (j=0; j<N; j++) {
#pragma acc for seq
            for (k=0; k<P; k++) 
                a[i][j] += b[i][k]*c[k][j] ;
        }
    }
}

void
MatrixMultiplication2(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N]) 
{
    int i, j, k ;

#pragma acc region for parallel, vector(8)
    for (i=0; i<M; i++){
#pragma acc for parallel, vector(8)
        for (j=0; j<N; j++) {
            float sum = 0.0 ;
#pragma acc for seq
            for (k=0; k<P; k++) 
                sum += b[i][k]*c[k][j] ;
            a[i][j] = sum ;
        }
    }
}

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p) 
{
    int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
    for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
        for (j=0; j<n; j++) {
#pragma acc for seq
            for (k=0; k<p; k++) 
                a[i*n+j] += b[i*p+k]*c[k*n+j] ;
	}
    }
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p) 
{
    int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
    for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
        for (j=0; j<n; j++) 
        {
            float sum = 0.0 ;
#pragma acc for seq
            for (k=0; k<p; k++) 
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}
$ pgcc -ta=nvidia -c -Minfo=accel MatrixMultiplication.c  -Msafeptr -V12.3
MatrixMultiplication1:
     12, Generating copy(a[:1023][:])
         Generating copyin(b[:1023][:])
         Generating copyin(c[:4095][:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     13, Loop is parallelizable
     15, Loop is parallelizable
     17, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         13, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         15, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
         17, #pragma acc for seq(8)
             Cached references to size [8x8] block of 'b'
             Cached references to size [8x8] block of 'c'
             CC 1.0 : 9 registers; 560 shared, 16 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 17 registers; 520 shared, 56 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication2:
     28, Generating copyout(a[:1023][:])
         Generating copyin(b[:1023][:])
         Generating copyin(c[:4095][:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     29, Loop is parallelizable
     31, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         31, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 12 registers; 48 shared, 8 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 20 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy
     34, Loop is parallelizable
MatrixMultiplication3:
     46, Generating copyout(a[:m*n-1])
         Generating copyin(c[:p*n-1])
         Generating copyin(b[:p*m-1])
     48, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     49, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     51, Loop is parallelizable
     53, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         49, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         51, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
         53, #pragma acc for seq
             CC 1.0 : 19 registers; 84 shared, 16 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 25 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
     65, Generating copyout(a[:m*n-1])
         Generating copyin(c[:p*n-1])
         Generating copyin(b[:p*m-1])
     67, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     68, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     70, Loop is parallelizable
         Accelerator kernel generated
         68, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         70, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 24 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
     74, Loop is parallelizable
  • Mat

Here’s the OpenACC versions:


$ cat MatrixMultiplication_openacc.c

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p) 
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
    for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
        for (j=0; j<n; j++) {
#pragma acc loop seq
            for (k=0; k<p; k++) 
                a[i*n+j] += b[i*p+k]*c[k*n+j] ;
	}
    }
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p) 
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
    for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
        for (j=0; j<n; j++) 
        {
            float sum = 0.0 ;
#pragma acc loop seq
            for (k=0; k<p; k++) 
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}

$ pgcc -acc -c -Minfo=accel MatrixMultiplication_openacc.c  -Msafeptr -V12.3
MatrixMultiplication3:
      7, Generating copyout(a[:m*n])
         Generating copyin(c[:p*n])
         Generating copyin(b[:p*m])
      9, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     10, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     12, Loop is parallelizable
     14, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         10, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         12, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
         14, #pragma acc loop seq
             Non-stride-1 accesses for array 'b'
             CC 1.0 : 16 registers; 84 shared, 12 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
     26, Generating copyout(a[:m*n])
         Generating copyin(c[:p*n])
         Generating copyin(b[:p*m])
     28, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     29, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     31, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         31, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
     35, Loop is parallelizable

When trying to create an exe with: pgcc matmul4.c -Msafeptr (I renamed the pgi accel file to matmul4.c) I receive this error:

libcmt.lib(crt0.oobj) : error LNK2019: unresolved external symbol main referenced in function __tmainCRTStartup
matmul4.exe : fatal error LNK1120: 1 unresolved externals

I tried adding " int main(); int main() {} " and it was able to create the exe but I have a feeling thats not a real fix, it doesn’t seem to be running the matrix multiplication. Any ideas on a solution?

Thanks

In the main program that you created, you will need to allocate and initialize the array’s, a, b, and c which are used in the matrix multiply functions. Then you will need to call one of the MatrixMultiply functions shown below. For example, here is what the code may look like if calling the PGI Accelerator Model MatrixMultiplication1 function:

#include <stdlib.h>

void main()
{
    float *a, *b, *c;
    int i;

    a = (float *) malloc(M*N*4);
    b = (float *) malloc(M*P*4);
    c = (float *) malloc(P*N*4);

    for (i = 0; i <  M*N; i++) {
       a[i] = (float) 0.0;
    }
    for (i = 0; i <  M*P; i++) {
       b[i] = (float) i;
    }
    for (i = 0; i <  P*N; i++) {
       c[i] = (float) 1.0;
    }

    MatrixMultiplication1((float **)a,(float **)b,(float **)c);

    printf("%f %f\n",a[0],a[M*N-1]);

    free(a);
    free(b);
    free(c);
}

Hi Mat,

I compiled your matrix-matrix multiplication once with pgcc 12.9 and once with pgcc 12.5. The result was that both compilers produced different schedules which result in differernt runtimes.

Compiler feedback of pgcc 12.9:

     24, Generating present_or_copyin(B[0:n*n])
         Generating present_or_copyin(A[0:n*n])
         Generating present_or_copy(C[0:n*n])
         Generating compute capability 2.0 binary
     27, Loop carried dependence of '*(C)' prevents parallelization
         Loop carried backward dependence of '*(C)' prevents vectorization
     29, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 2.0 : 25 registers; 0 shared, 80 constant, 0 local memory bytes
     32, Loop is parallelizable

Compiler feedback of pgcc 12.5:

     24, Generating copyin(B[0:n*n])
         Generating copyin(A[0:n*n])
         Generating copy(C[0:n*n])
         Generating compute capability 2.0 binary
     27, Loop carried dependence of '*(C)' prevents parallelization
         Loop carried backward dependence of '*(C)' prevents vectorization
     29, Loop is parallelizable
         Accelerator kernel generated
         27, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         29, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 2.0 : 22 registers; 8 shared, 80 constant, 0 local memory bytes
     32, Loop is parallelizable

To my surprise pgcc 12.5 yields sightly better performance than pgcc 12.9 (i.e. 2.4 sec vs. 2.8 for matrices of size 1024 x 1024 @ Quadro 6000 [with acc_init()]).

Why does the more recent compiler ignore my schedule, while pgcc 12.5 respects it?

Thank you.
Best,
Paul

Hi Paul,

12.9 is being a little more picky about the loop dependency and only parallelizlng the “j” loop. I should have done this with the first post, but the solution is to add the independent clause.

% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8) independent
    for (i=0; i<m; i++){
#pragma acc loop gang, vector(8) independent 
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
} 
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
      6, Generating copyout(a[0:m*n])
         Generating copyin(c[0:p*n])
         Generating copyin(b[0:p*m])
      8, Generating present_or_copyout(a[0:m*n])
         Generating present_or_copyin(b[0:p*m])
         Generating present_or_copyin(c[0:p*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      9, Loop is parallelizable
     11, Loop is parallelizable
         Accelerator kernel generated
          9, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         11, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 20 registers; 76 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
     14, Loop is parallelizable

Though, I doubt an 8x8 vector is still the idea schedule. Personally I’ve started to avoid using the loop schedule clauses altogether and let the compiler choose. I’m concerned that if I set the schedule for one device, it may not be the best for another.

% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop collapse(2) independent
    for (i=0; i<m; i++){
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
} 
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
      6, Generating copyout(a[0:m*n])
         Generating copyin(c[0:p*n])
         Generating copyin(b[0:p*m])
      8, Generating present_or_copyout(a[0:m*n])
         Generating present_or_copyin(b[0:p*m])
         Generating present_or_copyin(c[0:p*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      9, Loop is parallelizable
     10, Loop is parallelizable
         Accelerator kernel generated
          9, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         10, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             CC 1.0 : 19 registers; 76 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
     13, Loop is parallelizable
  • Mat

True, true, that would be ideal in term of portability for future devices.

Thank you for your help.

Best,
Paul