Hello. I’m sorry for silly question, but i’m new in parallel programming, so i’ve got difficulties. I wrote simple triple loop that multiplies two matrixes (matrix B is transposed):
start = clock();
#pragma acc region copyin(a[0:n-1][0:n-1],b[0:n-1][0:n-1]) copyout(c[0:n-1][0:n-1])
{
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
t=0;
for(k=0;k<n;k++)
t += (a[i][k]*b[j][k]);
c[i][j]=t;
}
}
}
printf("on GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);
Compiler writes:
C:\PGI\win32\10.4\bin>pgcc matrix.c -fast -ta=nvidia,time -Minfo
main:
43, Generating copyout(c[:n-1][:n-1])
Generating copyin(b[:n-1][:n-1])
Generating copyin(a[:n-1][:n-1])
Generating compute capability 1.0 kernel
Generating compute capability 1.3 kernel
45, Loop is parallelizable
Accelerator kernel generated
45, #pragma acc for parallel, vector(256)
46, Loop is parallelizable
49, Loop is parallelizable
So all loops are parallelizable and everything seems to be good, but execution times are strange, for matrixes 900x900 they are:
on GPU = 4.016000 sec
on CPU = 1.453000 sec
Please, give me advice how to optimize this code for GPU, or what is wrong with it.
Also there is strange time statistics with the key -ta=nvidia,time:
main
43: region entered 1 time
time(us): total=8000000
kernels=3951644 data=4048356
I always got integer even “total” time like 2,4,6,8 seconds and so on, and this time is much greater than execution time. Also there is no statistics about init time, but if “total” time becomes less than 2 sec, it writes:
main
43: region entered 1 time
time(us): init=0
My GPU is GeForce 9500 GT.