Hi,
I’ve recently taken interest in using OpenAcc in my code. To try it out I’ve setup a small mm multiplication as so:
int i, j, k, n = 4096;
float A[n][n], B[n][n], RES[n][n];
#pragma acc kernels create(A,B) copyout(RES)
{
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j) {
A[i][j] = i+j;
B[i][j] = j-i;}
#pragma acc loop gang, vector(8)
for (i = 0; i < n; ++i){
#pragma acc loop gang, vector(64)
for (j = 0; j < n; ++j){
for (k = 0; k < n; ++k)
RES[i][j] += A[i][k] * B[k][j];
}
}
}
Now understand the basic concept behind gangs and workers and vectors, where a gang is a block of threads and vectors are threads in the block. However I don’t understand how this division will map in the GPU. The specification of 8 and 64 are purely experimental, as they seem to give me the best results. Here are the main questions that are currently in my head:
1- On which basis do I determine the best gang/vector division ?
2- Why do certain specifications gives wrong results, how could that happen ? I’ve tried setting the number of gangs to 128 and the vectors to 64 as a simple test for both loops. This got me extremely fast execution (~5x speedup over current settings) but with wrong results of multiplication.
3- When trying to to multiply with the transpose of B as a way of optimizing code, the program crashes and the GPU seems to halt for a bit.
int i, j, k, n = 4096;
float A[n][n], B[n][n], RES[n][n];
#pragma acc kernels create(A,B) copyout(RES)
{
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j) {
A[i][j] = i+j;
//Transpose B
B[j][i] = j-i;}
#pragma acc loop gang, vector(8)
for (i = 0; i < n; ++i){
#pragma acc loop gang, vector(64)
for (j = 0; j < n; ++j){
for (k = 0; k < n; ++k)
//Flip indices of B
RES[i][j] += A[i][k] * B[j][k];
}
}
}
4- What other methods of optimizations could I take advantage of ?
Here’s the output of the compilation:
6, Generating present_or_create(B[0:][0:])
Generating present_or_create(A[0:][0:])
Generating present_or_copyout(RES[0:][0:])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
8, Loop is parallelizable
9, Loop is parallelizable
Accelerator kernel generated
8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
9, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
CC 1.0 : 16 registers; 64 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
14, Loop is parallelizable
16, Loop is parallelizable
17, Complex loop carried dependence of 'RES' prevents parallelization
Loop carried dependence of 'RES' prevents parallelization
Loop carried backward dependence of 'RES' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
14, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
16, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
17, CC 1.0 : 16 registers; 92 shared, 12 constant, 4 local memory bytes
CC 2.0 : 34 registers; 0 shared, 108 constant, 0 local memory bytes
and after running:
6: region entered 1 time
time(us): total=353,907 init=248,353 region=105,554
kernels=101,665 data=3,183
w/o init: total=105,554 max=105,554 min=105,554 avg=105,554
9: kernel launched 1 times
grid: [16x256] block: [64x4]
time(us): total=421 max=421 min=421 avg=421
17: kernel launched 1 times
grid: [16x128] block: [64x8]
time(us): total=101,244 max=101,244 min=101,244 avg=101,244
I’m using a Quadro FX 4800 GPU. Looking forward to your answers. Thank you.