Help understanding gang and vector specification

Hi,

I’ve recently taken interest in using OpenAcc in my code. To try it out I’ve setup a small mm multiplication as so:

int i, j, k, n = 4096;
    float A[n][n], B[n][n], RES[n][n];
    #pragma acc kernels create(A,B) copyout(RES)
	{
		for (i = 0; i < n; ++i)
			for (j = 0; j < n; ++j) {
				A[i][j] = i+j;
				B[i][j] = j-i;}
        
        #pragma acc loop gang, vector(8)
		for (i = 0; i < n; ++i){
            #pragma acc loop gang, vector(64)
			for (j = 0; j < n; ++j){
				for (k = 0; k < n; ++k)
					RES[i][j] += A[i][k] * B[k][j];
			}
		}
	}

Now understand the basic concept behind gangs and workers and vectors, where a gang is a block of threads and vectors are threads in the block. However I don’t understand how this division will map in the GPU. The specification of 8 and 64 are purely experimental, as they seem to give me the best results. Here are the main questions that are currently in my head:

1- On which basis do I determine the best gang/vector division ?

2- Why do certain specifications gives wrong results, how could that happen ? I’ve tried setting the number of gangs to 128 and the vectors to 64 as a simple test for both loops. This got me extremely fast execution (~5x speedup over current settings) but with wrong results of multiplication.

3- When trying to to multiply with the transpose of B as a way of optimizing code, the program crashes and the GPU seems to halt for a bit.

int i, j, k, n = 4096;
    float A[n][n], B[n][n], RES[n][n];
    #pragma acc kernels create(A,B) copyout(RES)
	{
		for (i = 0; i < n; ++i)
			for (j = 0; j < n; ++j) {
				A[i][j] = i+j;
                                //Transpose B
				B[j][i] = j-i;}
        
        #pragma acc loop gang, vector(8)
		for (i = 0; i < n; ++i){
            #pragma acc loop gang, vector(64)
			for (j = 0; j < n; ++j){
				for (k = 0; k < n; ++k)
                                        //Flip indices of B
					RES[i][j] += A[i][k] * B[j][k];
			}
		}
	}

4- What other methods of optimizations could I take advantage of ?


Here’s the output of the compilation:

6, Generating present_or_create(B[0:][0:])
         Generating present_or_create(A[0:][0:])
         Generating present_or_copyout(RES[0:][0:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      8, Loop is parallelizable
      9, Loop is parallelizable
         Accelerator kernel generated
          8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
          9, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             CC 1.0 : 16 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
     14, Loop is parallelizable
     16, Loop is parallelizable
     17, Complex loop carried dependence of 'RES' prevents parallelization
         Loop carried dependence of 'RES' prevents parallelization
         Loop carried backward dependence of 'RES' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         14, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         16, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         17, CC 1.0 : 16 registers; 92 shared, 12 constant, 4 local memory bytes
             CC 2.0 : 34 registers; 0 shared, 108 constant, 0 local memory bytes

and after running:

6: region entered 1 time
        time(us): total=353,907 init=248,353 region=105,554
                  kernels=101,665 data=3,183
        w/o init: total=105,554 max=105,554 min=105,554 avg=105,554
        9: kernel launched 1 times
            grid: [16x256]  block: [64x4]
            time(us): total=421 max=421 min=421 avg=421
        17: kernel launched 1 times
            grid: [16x128]  block: [64x8]
            time(us): total=101,244 max=101,244 min=101,244 avg=101,244

I’m using a Quadro FX 4800 GPU. Looking forward to your answers. Thank you.

Hi n54,

1- On which basis do I determine the best gang/vector division ?

Personally, I let the compiler choose. This allows for better performance portability across multiple device types. In cases where you do want to change the schedule, there isn’t methodology other than experimentation.

2- Why do certain specifications gives wrong results, how could that happen ?

The schedule does effect code generation so it’s possible for certain schedules to produce poor code. As for why your specific case got wrong answers, I’d need to have a reproducing example.

3- When trying to to multiply with the transpose of B as a way of optimizing code, the program crashes and the GPU seems to halt for a bit.

I don’t see anything obvious. Can you post the full reproducing example?

4- What other methods of optimizations could I take advantage of ?

Optimizing your host/device data movement is the most critical performance factor.

The second most important factor is having the “vector” accessing stride-1 data (in C this is the row dimension).

Note that one of the benefits of using OpenACC is to not have to hand optimize your code as you would in CUDA. Basically, you just need to worry about data movement and layout.

Though, one thing you can try is using a temp scaler to do the reduction. Some thing like:

        #pragma acc loop gang collapse(2) 
      for (i = 0; i < n; ++i){
         for (j = 0; j < n; ++j){
            float sum = 0.0f;
       #pragma acc loop vector reduction(+:sum)
            for (k = 0; k < n; ++k)
               sum += A[i][k] * B[j][k];
           RES[i][j] = sum;
         }
      }

The caveat being that there is overhead when using reductions.

  • Mat