Matrix multiplication parallelizing

Hello. I’m sorry for silly question, but i’m new in parallel programming, so i’ve got difficulties. I wrote simple triple loop that multiplies two matrixes (matrix B is transposed):

start = clock();
#pragma acc region copyin(a[0:n-1][0:n-1],b[0:n-1][0:n-1]) copyout(c[0:n-1][0:n-1])
{
	for(i=0;i<n;i++)
	{
		for(j=0;j<n;j++)
		{
			t=0;
			for(k=0;k<n;k++)
				t += (a[i][k]*b[j][k]);
			c[i][j]=t;
		}
	}
}
printf("on GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);

Compiler writes:

C:\PGI\win32\10.4\bin>pgcc matrix.c -fast -ta=nvidia,time -Minfo
main:
     43, Generating copyout(c[:n-1][:n-1])
         Generating copyin(b[:n-1][:n-1])
         Generating copyin(a[:n-1][:n-1])
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     45, Loop is parallelizable
         Accelerator kernel generated
         45, #pragma acc for parallel, vector(256)
     46, Loop is parallelizable
     49, Loop is parallelizable

So all loops are parallelizable and everything seems to be good, but execution times are strange, for matrixes 900x900 they are:
on GPU = 4.016000 sec
on CPU = 1.453000 sec
Please, give me advice how to optimize this code for GPU, or what is wrong with it.
Also there is strange time statistics with the key -ta=nvidia,time:

  main
    43: region entered 1 time
        time(us): total=8000000
                  kernels=3951644 data=4048356

I always got integer even “total” time like 2,4,6,8 seconds and so on, and this time is much greater than execution time. Also there is no statistics about init time, but if “total” time becomes less than 2 sec, it writes:

  main
    43: region entered 1 time
        time(us): init=0

My GPU is GeForce 9500 GT.

Hi Mikee,

I tried unsuccessfully to recreate your problem here. Can you please post your full source or send it to PGI Customer Support (trs@pgroup.com)?

Thanks,
Mat

Thank you Mat for the reply. I deleted all useless stuff, and here is the code with same problem:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main( int argc, char* argv[] )
{
	int n; //size of the matrix
	float **restrict a;
	float **restrict b;
	float **restrict c; //acc result matrix
	float **restrict d; //host result matrix
	int i,j,k;
	float t; //temp sum
	clock_t start;
	n = atoi( argv[1] );

	a = (float**) malloc( sizeof(float*) * n );
	a[0] = (float*) malloc( sizeof(float) * n * n );
	b = (float**) malloc( sizeof(float*) * n );
	b[0] = (float*) malloc( sizeof(float) * n * n );
	c = (float**) malloc( sizeof(float*) * n );
	c[0] = (float*) malloc( sizeof(float) * n * n );
	d = (float**) malloc( sizeof(float*) * n );
	d[0] = (float*) malloc( sizeof(float) * n * n );
	for( i = 1; i < n; i++ )
	{
		a[i] = a[i-1] + n;
		b[i] = b[i-1] + n;
		c[i] = c[i-1] + n;
		d[i] = d[i-1] + n;
	}
	
    for( i = 0; i < n; ++i )
		for(j=0;j<n;j++)
		{
			a[i][j] = (float)(0.4*i+2.3*j+1);
			b[j][i] = (float)(3*i+0.7*j+1); //transpose matrix
			c[i][j] = 0;
			d[i][j] = 0;
		}			

	start = clock();
	#pragma acc region
	{
		for(i=0;i<n;i++)
			for(j=0;j<n;j++)
			{
				t=0;
				for(k=0;k<n;k++)
					t += (a[i][k]*b[j][k]);
				c[i][j]=t;
			}
	}
	printf("time GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);

	start = clock(); //compare on the host
	{
		for(i=0;i<n;i++)
			for(j=0;j<n;j++)
			{
				t=0;
				for(k=0;k<n;k++)
					t += (a[i][k]*b[j][k]);
				d[i][j]=t;
			}
    }
	printf("\ntime CPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);
	free(a);
	free(b);
	free(c);
	free(d);
	return 0;
}

Ok, i’ve read PGIUG carefully and found the “kernel” loop scheduling clause. It is described: “The kernel clause tells the compiler that the body of this loop is to be the body of the computational kernel. Any loops contained within the kernel loop are executed sequentially on the accelerator”. So i rewrote the code to store matrixes in one-dimensional arrays. My plan was to make double loop instead of triple, and with using the “kernel” clause, force outer loop execute parallel with iterations independence and inner loop - sequentially.
Definitions of arrays are like this:

double *restrict a;
a = malloc( sizeof(double) *n*n );

Now my main loop is:

	#pragma acc data region copyin(a[0:n*n-1],b[0:n*n-1]) copyout(c[0:n*n-1]) local(t,p,r)
	#pragma acc region for kernel
	for(i=0;i<n*n;i++)
	{
		t=0;
		p=i/n;
		r=i-p*n;
		for(k=0;k<n;k++)
			t += (a[p*n+k]*b[n*r+k]);
		c[i]=t;
	}

And compiler still says:

	39, Loop is parallelizable

Where line No.39 is loop “for(k=0;k<n;k++)”
How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase “Loop is parallelizable” means only possibility of parallelizing, but not that loop IS PARALLELIZED?

Hi Mikee,

Sorry about only getting back to this now.

How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase “Loop is parallelizable” means only possibility of parallelizing, but not that loop IS PARALLELIZED?

Correct. The message means only that it is possible to parallelize the code, not that it was parallelized. The -Minfo message “Accelerator kernel generated” tells you what was actually accelerated.

The problem with the original loop is that since you’re using pointers, the compiler is only able to accelerate the outer loop. Hence, you’re only getting 900 threads and slower performance. Our compiler team is aware of the issue and hopefully can enhance our analysis so both loops are accelerated. In the meantime, moving to a single dimension is a good strategy.

  • Mat