Matrix multiplication parallelizing

Mikee · May 23, 2010, 5:23pm

Hello. I’m sorry for silly question, but i’m new in parallel programming, so i’ve got difficulties. I wrote simple triple loop that multiplies two matrixes (matrix B is transposed):

start = clock();
#pragma acc region copyin(a[0:n-1][0:n-1],b[0:n-1][0:n-1]) copyout(c[0:n-1][0:n-1])
{
	for(i=0;i<n;i++)
	{
		for(j=0;j<n;j++)
		{
			t=0;
			for(k=0;k<n;k++)
				t += (a[i][k]*b[j][k]);
			c[i][j]=t;
		}
	}
}
printf("on GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);

Compiler writes:

C:\PGI\win32\10.4\bin>pgcc matrix.c -fast -ta=nvidia,time -Minfo
main:
     43, Generating copyout(c[:n-1][:n-1])
         Generating copyin(b[:n-1][:n-1])
         Generating copyin(a[:n-1][:n-1])
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     45, Loop is parallelizable
         Accelerator kernel generated
         45, #pragma acc for parallel, vector(256)
     46, Loop is parallelizable
     49, Loop is parallelizable

So all loops are parallelizable and everything seems to be good, but execution times are strange, for matrixes 900x900 they are:
on GPU = 4.016000 sec
on CPU = 1.453000 sec
Please, give me advice how to optimize this code for GPU, or what is wrong with it.
Also there is strange time statistics with the key -ta=nvidia,time:

  main
    43: region entered 1 time
        time(us): total=8000000
                  kernels=3951644 data=4048356

I always got integer even “total” time like 2,4,6,8 seconds and so on, and this time is much greater than execution time. Also there is no statistics about init time, but if “total” time becomes less than 2 sec, it writes:

  main
    43: region entered 1 time
        time(us): init=0

My GPU is GeForce 9500 GT.

MatColgrove · May 27, 2010, 6:52pm

Hi Mikee,

I tried unsuccessfully to recreate your problem here. Can you please post your full source or send it to PGI Customer Support (trs@pgroup.com)?

Thanks,
Mat

Mikee · May 27, 2010, 8:06pm

Thank you Mat for the reply. I deleted all useless stuff, and here is the code with same problem:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main( int argc, char* argv[] )
{
	int n; //size of the matrix
	float **restrict a;
	float **restrict b;
	float **restrict c; //acc result matrix
	float **restrict d; //host result matrix
	int i,j,k;
	float t; //temp sum
	clock_t start;
	n = atoi( argv[1] );

	a = (float**) malloc( sizeof(float*) * n );
	a[0] = (float*) malloc( sizeof(float) * n * n );
	b = (float**) malloc( sizeof(float*) * n );
	b[0] = (float*) malloc( sizeof(float) * n * n );
	c = (float**) malloc( sizeof(float*) * n );
	c[0] = (float*) malloc( sizeof(float) * n * n );
	d = (float**) malloc( sizeof(float*) * n );
	d[0] = (float*) malloc( sizeof(float) * n * n );
	for( i = 1; i < n; i++ )
	{
		a[i] = a[i-1] + n;
		b[i] = b[i-1] + n;
		c[i] = c[i-1] + n;
		d[i] = d[i-1] + n;
	}
	
    for( i = 0; i < n; ++i )
		for(j=0;j<n;j++)
		{
			a[i][j] = (float)(0.4*i+2.3*j+1);
			b[j][i] = (float)(3*i+0.7*j+1); //transpose matrix
			c[i][j] = 0;
			d[i][j] = 0;
		}			

	start = clock();
	#pragma acc region
	{
		for(i=0;i<n;i++)
			for(j=0;j<n;j++)
			{
				t=0;
				for(k=0;k<n;k++)
					t += (a[i][k]*b[j][k]);
				c[i][j]=t;
			}
	}
	printf("time GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);

	start = clock(); //compare on the host
	{
		for(i=0;i<n;i++)
			for(j=0;j<n;j++)
			{
				t=0;
				for(k=0;k<n;k++)
					t += (a[i][k]*b[j][k]);
				d[i][j]=t;
			}
    }
	printf("\ntime CPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);
	free(a);
	free(b);
	free(c);
	free(d);
	return 0;
}

Mikee · May 31, 2010, 8:16am

Ok, i’ve read PGIUG carefully and found the “kernel” loop scheduling clause. It is described: “The kernel clause tells the compiler that the body of this loop is to be the body of the computational kernel. Any loops contained within the kernel loop are executed sequentially on the accelerator”. So i rewrote the code to store matrixes in one-dimensional arrays. My plan was to make double loop instead of triple, and with using the “kernel” clause, force outer loop execute parallel with iterations independence and inner loop - sequentially.
Definitions of arrays are like this:

double *restrict a;
a = malloc( sizeof(double) *n*n );

Now my main loop is:

	#pragma acc data region copyin(a[0:n*n-1],b[0:n*n-1]) copyout(c[0:n*n-1]) local(t,p,r)
	#pragma acc region for kernel
	for(i=0;i<n*n;i++)
	{
		t=0;
		p=i/n;
		r=i-p*n;
		for(k=0;k<n;k++)
			t += (a[p*n+k]*b[n*r+k]);
		c[i]=t;
	}

And compiler still says:

	39, Loop is parallelizable

Where line No.39 is loop “for(k=0;k<n;k++)”
How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase “Loop is parallelizable” means only possibility of parallelizing, but not that loop IS PARALLELIZED?

MatColgrove · June 1, 2010, 7:38pm

Hi Mikee,

Sorry about only getting back to this now.

How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase “Loop is parallelizable” means only possibility of parallelizing, but not that loop IS PARALLELIZED?

Correct. The message means only that it is possible to parallelize the code, not that it was parallelized. The -Minfo message “Accelerator kernel generated” tells you what was actually accelerated.

The problem with the original loop is that since you’re using pointers, the compiler is only able to accelerate the outer loop. Hence, you’re only getting 900 threads and slower performance. Our compiler team is aware of the issue and hopefully can enhance our analysis so both loops are accelerated. In the meantime, moving to a single dimension is a good strategy.

Mat

Topic		Replies	Views
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5238	September 10, 2010
MatMul with openACC Legacy PGI Compilers	7	13195	December 17, 2012
2D Matrix copy problem Legacy PGI Compilers	6	4701	October 10, 2015
How to parallelize this loop... Legacy PGI Compilers	14	7973	December 18, 2012
Loop is parallelizable Legacy PGI Compilers	2	1832	June 10, 2010
Need advices for optimizing "matrix.vector product&quot Legacy PGI Compilers	3	3317	November 23, 2016
multiple small matrices multiplication Legacy PGI Compilers	6	5776	June 22, 2015
Strangely long loop execution time Legacy PGI Compilers	5	6212	February 18, 2011
accelerator parallization issues Legacy PGI Compilers	18	26859	April 12, 2010
Six Loops iteration and reduction Legacy PGI Compilers	15	8035	March 27, 2012

Matrix multiplication parallelizing

Related topics