 # MatMul with openACC

Hello,

Does someone have a simple matrix multiplication example that runs with pgcc on the gpu? I’ve tried a lot of things but it seems like I always get stuck with the parallelization of the innermost loop (either invalid loops (if I try to create a private array to resolve dependence), or cannot be parallelized/vectorized error). The code does run though if I compile it normally (not for an accelerator).

Any help would be greatly appreciated. I’m a newbie trying to learn how to use openACC for accelerating my code on the GPU.

Hi Tapasya Patki,

Here’s some examples using the PGI Accelerator Model. I’ll post the OpenACC versions next. The major difference is that OpenACC only allows for 1-D arrays so “float **” need to be translated to “float *” and the “copy” clauses in OpenACC use starting bounds plus the number of elements to copy.

``````\$ cat MatrixMultiplication.c
#define M 1024
#define N 2048
#define P 4096

/* muliply a M lines, P columns matrix by a P lines, N columns matrix
to produce a M lines, N columns matrix */
void
MatrixMultiplication1(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N])
{
int i, j, k ;

#pragma acc region for parallel, vector(8)
for (i=0; i<M; i++) {
#pragma acc for parallel, vector(8)
for (j=0; j<N; j++) {
#pragma acc for seq
for (k=0; k<P; k++)
a[i][j] += b[i][k]*c[k][j] ;
}
}
}

void
MatrixMultiplication2(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N])
{
int i, j, k ;

#pragma acc region for parallel, vector(8)
for (i=0; i<M; i++){
#pragma acc for parallel, vector(8)
for (j=0; j<N; j++) {
float sum = 0.0 ;
#pragma acc for seq
for (k=0; k<P; k++)
sum += b[i][k]*c[k][j] ;
a[i][j] = sum ;
}
}
}

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
for (j=0; j<n; j++) {
#pragma acc for seq
for (k=0; k<p; k++)
a[i*n+j] += b[i*p+k]*c[k*n+j] ;
}
}
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
for (j=0; j<n; j++)
{
float sum = 0.0 ;
#pragma acc for seq
for (k=0; k<p; k++)
sum += b[i*p+k]*c[k*n+j] ;
a[i*n+j] = sum ;
}
}
}
}
\$ pgcc -ta=nvidia -c -Minfo=accel MatrixMultiplication.c  -Msafeptr -V12.3
MatrixMultiplication1:
12, Generating copy(a[:1023][:])
Generating copyin(b[:1023][:])
Generating copyin(c[:4095][:])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
13, Loop is parallelizable
15, Loop is parallelizable
17, Complex loop carried dependence of '*(a)' prevents parallelization
Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
Accelerator kernel generated
13, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
15, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
17, #pragma acc for seq(8)
Cached references to size [8x8] block of 'b'
Cached references to size [8x8] block of 'c'
CC 1.0 : 9 registers; 560 shared, 16 constant, 0 local memory bytes; 66% occupancy
CC 2.0 : 17 registers; 520 shared, 56 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication2:
28, Generating copyout(a[:1023][:])
Generating copyin(b[:1023][:])
Generating copyin(c[:4095][:])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
29, Loop is parallelizable
31, Loop is parallelizable
Accelerator kernel generated
29, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
31, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
CC 1.0 : 12 registers; 48 shared, 8 constant, 0 local memory bytes; 66% occupancy
CC 2.0 : 20 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy
34, Loop is parallelizable
MatrixMultiplication3:
46, Generating copyout(a[:m*n-1])
Generating copyin(c[:p*n-1])
Generating copyin(b[:p*m-1])
48, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
49, Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
51, Loop is parallelizable
53, Complex loop carried dependence of '*(a)' prevents parallelization
Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
Accelerator kernel generated
49, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
51, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
53, #pragma acc for seq
CC 1.0 : 19 registers; 84 shared, 16 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 25 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
65, Generating copyout(a[:m*n-1])
Generating copyin(c[:p*n-1])
Generating copyin(b[:p*m-1])
67, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
68, Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
70, Loop is parallelizable
Accelerator kernel generated
68, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
70, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 24 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
74, Loop is parallelizable
``````
• Mat

Here’s the OpenACC versions:

``````\$ cat MatrixMultiplication_openacc.c

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
for (j=0; j<n; j++) {
#pragma acc loop seq
for (k=0; k<p; k++)
a[i*n+j] += b[i*p+k]*c[k*n+j] ;
}
}
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
for (j=0; j<n; j++)
{
float sum = 0.0 ;
#pragma acc loop seq
for (k=0; k<p; k++)
sum += b[i*p+k]*c[k*n+j] ;
a[i*n+j] = sum ;
}
}
}
}

\$ pgcc -acc -c -Minfo=accel MatrixMultiplication_openacc.c  -Msafeptr -V12.3
MatrixMultiplication3:
7, Generating copyout(a[:m*n])
Generating copyin(c[:p*n])
Generating copyin(b[:p*m])
9, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
10, Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
12, Loop is parallelizable
14, Complex loop carried dependence of '*(a)' prevents parallelization
Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
Accelerator kernel generated
10, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
12, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
14, #pragma acc loop seq
Non-stride-1 accesses for array 'b'
CC 1.0 : 16 registers; 84 shared, 12 constant, 0 local memory bytes; 66% occupancy
CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
26, Generating copyout(a[:m*n])
Generating copyin(c[:p*n])
Generating copyin(b[:p*m])
28, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
29, Loop carried dependence of '*(a)' prevents parallelization
Loop carried backward dependence of '*(a)' prevents vectorization
31, Loop is parallelizable
Accelerator kernel generated
29, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
31, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
35, Loop is parallelizable
``````

When trying to create an exe with: pgcc matmul4.c -Msafeptr (I renamed the pgi accel file to matmul4.c) I receive this error:

libcmt.lib(crt0.oobj) : error LNK2019: unresolved external symbol main referenced in function __tmainCRTStartup
matmul4.exe : fatal error LNK1120: 1 unresolved externals

I tried adding " int main(); int main() {} " and it was able to create the exe but I have a feeling thats not a real fix, it doesn’t seem to be running the matrix multiplication. Any ideas on a solution?

Thanks

In the main program that you created, you will need to allocate and initialize the array’s, a, b, and c which are used in the matrix multiply functions. Then you will need to call one of the MatrixMultiply functions shown below. For example, here is what the code may look like if calling the PGI Accelerator Model MatrixMultiplication1 function:

``````#include <stdlib.h>

void main()
{
float *a, *b, *c;
int i;

a = (float *) malloc(M*N*4);
b = (float *) malloc(M*P*4);
c = (float *) malloc(P*N*4);

for (i = 0; i <  M*N; i++) {
a[i] = (float) 0.0;
}
for (i = 0; i <  M*P; i++) {
b[i] = (float) i;
}
for (i = 0; i <  P*N; i++) {
c[i] = (float) 1.0;
}

MatrixMultiplication1((float **)a,(float **)b,(float **)c);

printf("%f %f\n",a,a[M*N-1]);

free(a);
free(b);
free(c);
}
``````

Hi Mat,

I compiled your matrix-matrix multiplication once with pgcc 12.9 and once with pgcc 12.5. The result was that both compilers produced different schedules which result in differernt runtimes.

Compiler feedback of pgcc 12.9:

``````     24, Generating present_or_copyin(B[0:n*n])
Generating present_or_copyin(A[0:n*n])
Generating present_or_copy(C[0:n*n])
Generating compute capability 2.0 binary
27, Loop carried dependence of '*(C)' prevents parallelization
Loop carried backward dependence of '*(C)' prevents vectorization
29, Loop is parallelizable
Accelerator kernel generated
29, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
CC 2.0 : 25 registers; 0 shared, 80 constant, 0 local memory bytes
32, Loop is parallelizable
``````

Compiler feedback of pgcc 12.5:

``````     24, Generating copyin(B[0:n*n])
Generating copyin(A[0:n*n])
Generating copy(C[0:n*n])
Generating compute capability 2.0 binary
27, Loop carried dependence of '*(C)' prevents parallelization
Loop carried backward dependence of '*(C)' prevents vectorization
29, Loop is parallelizable
Accelerator kernel generated
27, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
29, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
CC 2.0 : 22 registers; 8 shared, 80 constant, 0 local memory bytes
32, Loop is parallelizable
``````

To my surprise pgcc 12.5 yields sightly better performance than pgcc 12.9 (i.e. 2.4 sec vs. 2.8 for matrices of size 1024 x 1024 @ Quadro 6000 [with acc_init()]).

Why does the more recent compiler ignore my schedule, while pgcc 12.5 respects it?

Thank you.
Best,
Paul

Hi Paul,

12.9 is being a little more picky about the loop dependency and only parallelizlng the “j” loop. I should have done this with the first post, but the solution is to add the independent clause.

``````% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8) independent
for (i=0; i<m; i++){
#pragma acc loop gang, vector(8) independent
for (j=0; j<n; j++)
{
float sum = 0.0 ;
for (k=0; k<p; k++)
sum += b[i*p+k]*c[k*n+j] ;
a[i*n+j] = sum ;
}
}
}
}
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
6, Generating copyout(a[0:m*n])
Generating copyin(c[0:p*n])
Generating copyin(b[0:p*m])
8, Generating present_or_copyout(a[0:m*n])
Generating present_or_copyin(b[0:p*m])
Generating present_or_copyin(c[0:p*n])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
9, Loop is parallelizable
11, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
11, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
CC 1.0 : 20 registers; 76 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
14, Loop is parallelizable
``````

Though, I doubt an 8x8 vector is still the idea schedule. Personally I’ve started to avoid using the loop schedule clauses altogether and let the compiler choose. I’m concerned that if I set the schedule for one device, it may not be the best for another.

``````% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop collapse(2) independent
for (i=0; i<m; i++){
for (j=0; j<n; j++)
{
float sum = 0.0 ;
for (k=0; k<p; k++)
sum += b[i*p+k]*c[k*n+j] ;
a[i*n+j] = sum ;
}
}
}
}
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
6, Generating copyout(a[0:m*n])
Generating copyin(c[0:p*n])
Generating copyin(b[0:p*m])
8, Generating present_or_copyout(a[0:m*n])
Generating present_or_copyin(b[0:p*m])
Generating present_or_copyin(c[0:p*n])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
9, Loop is parallelizable
10, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
10, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
CC 1.0 : 19 registers; 76 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
13, Loop is parallelizable
``````
• Mat

True, true, that would be ideal in term of portability for future devices.