Help understanding gang and vector specification

n54 · November 24, 2012, 4:40am

Hi,

I’ve recently taken interest in using OpenAcc in my code. To try it out I’ve setup a small mm multiplication as so:

int i, j, k, n = 4096;
    float A[n][n], B[n][n], RES[n][n];
    #pragma acc kernels create(A,B) copyout(RES)
	{
		for (i = 0; i < n; ++i)
			for (j = 0; j < n; ++j) {
				A[i][j] = i+j;
				B[i][j] = j-i;}
        
        #pragma acc loop gang, vector(8)
		for (i = 0; i < n; ++i){
            #pragma acc loop gang, vector(64)
			for (j = 0; j < n; ++j){
				for (k = 0; k < n; ++k)
					RES[i][j] += A[i][k] * B[k][j];
			}
		}
	}

Now understand the basic concept behind gangs and workers and vectors, where a gang is a block of threads and vectors are threads in the block. However I don’t understand how this division will map in the GPU. The specification of 8 and 64 are purely experimental, as they seem to give me the best results. Here are the main questions that are currently in my head:

1- On which basis do I determine the best gang/vector division ?

2- Why do certain specifications gives wrong results, how could that happen ? I’ve tried setting the number of gangs to 128 and the vectors to 64 as a simple test for both loops. This got me extremely fast execution (~5x speedup over current settings) but with wrong results of multiplication.

3- When trying to to multiply with the transpose of B as a way of optimizing code, the program crashes and the GPU seems to halt for a bit.

int i, j, k, n = 4096;
    float A[n][n], B[n][n], RES[n][n];
    #pragma acc kernels create(A,B) copyout(RES)
	{
		for (i = 0; i < n; ++i)
			for (j = 0; j < n; ++j) {
				A[i][j] = i+j;
                                //Transpose B
				B[j][i] = j-i;}
        
        #pragma acc loop gang, vector(8)
		for (i = 0; i < n; ++i){
            #pragma acc loop gang, vector(64)
			for (j = 0; j < n; ++j){
				for (k = 0; k < n; ++k)
                                        //Flip indices of B
					RES[i][j] += A[i][k] * B[j][k];
			}
		}
	}

4- What other methods of optimizations could I take advantage of ?

Here’s the output of the compilation:

6, Generating present_or_create(B[0:][0:])
         Generating present_or_create(A[0:][0:])
         Generating present_or_copyout(RES[0:][0:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      8, Loop is parallelizable
      9, Loop is parallelizable
         Accelerator kernel generated
          8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
          9, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             CC 1.0 : 16 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
     14, Loop is parallelizable
     16, Loop is parallelizable
     17, Complex loop carried dependence of 'RES' prevents parallelization
         Loop carried dependence of 'RES' prevents parallelization
         Loop carried backward dependence of 'RES' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         14, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         16, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         17, CC 1.0 : 16 registers; 92 shared, 12 constant, 4 local memory bytes
             CC 2.0 : 34 registers; 0 shared, 108 constant, 0 local memory bytes

and after running:

6: region entered 1 time
        time(us): total=353,907 init=248,353 region=105,554
                  kernels=101,665 data=3,183
        w/o init: total=105,554 max=105,554 min=105,554 avg=105,554
        9: kernel launched 1 times
            grid: [16x256]  block: [64x4]
            time(us): total=421 max=421 min=421 avg=421
        17: kernel launched 1 times
            grid: [16x128]  block: [64x8]
            time(us): total=101,244 max=101,244 min=101,244 avg=101,244

I’m using a Quadro FX 4800 GPU. Looking forward to your answers. Thank you.

MatColgrove · November 26, 2012, 7:22pm

Hi n54,

1- On which basis do I determine the best gang/vector division ?

Personally, I let the compiler choose. This allows for better performance portability across multiple device types. In cases where you do want to change the schedule, there isn’t methodology other than experimentation.

2- Why do certain specifications gives wrong results, how could that happen ?

The schedule does effect code generation so it’s possible for certain schedules to produce poor code. As for why your specific case got wrong answers, I’d need to have a reproducing example.

3- When trying to to multiply with the transpose of B as a way of optimizing code, the program crashes and the GPU seems to halt for a bit.

I don’t see anything obvious. Can you post the full reproducing example?

4- What other methods of optimizations could I take advantage of ?

Optimizing your host/device data movement is the most critical performance factor.

The second most important factor is having the “vector” accessing stride-1 data (in C this is the row dimension).

Note that one of the benefits of using OpenACC is to not have to hand optimize your code as you would in CUDA. Basically, you just need to worry about data movement and layout.

Though, one thing you can try is using a temp scaler to do the reduction. Some thing like:

        #pragma acc loop gang collapse(2) 
      for (i = 0; i < n; ++i){
         for (j = 0; j < n; ++j){
            float sum = 0.0f;
       #pragma acc loop vector reduction(+:sum)
            for (k = 0; k < n; ++k)
               sum += A[i][k] * B[j][k];
           RES[i][j] = sum;
         }
      }

The caveat being that there is overhead when using reductions.

Mat

Topic		Replies	Views
OpenACC: Fine tuning accelerator performance nvc, nvc++ and nvfortran	5	1334	March 18, 2021
OpenACC parallel loop gang, vector Legacy PGI Compilers	4	6617	December 7, 2023
OpenACC reduction with AMD R9 390X is not working Legacy PGI Compilers	3	3052	September 3, 2015
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7034	February 10, 2016
openacc routine function efficiency Legacy PGI Compilers	1	3285	July 2, 2018
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1519	December 4, 2020
General cuda/OpenAcc thread question Legacy PGI Compilers	2	2936	August 6, 2020
OpenACC Gang-Vector Performance Legacy PGI Compilers	4	3679	June 18, 2015
gangs, worker and vector Legacy PGI Compilers	1	3023	July 29, 2015

Help understanding gang and vector specification

Related topics