contiguous, flated multidimensional array

Hi,

To be sure that all arrays are contiguous and fastest accessed, I want to make flat all arrays in my program.
Before to implement it, I’ve done a test for 5D array.

#include<stdio.h>
#include<stdlib.h>

#define flat5(m0, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1) m0[(i5-l5) * (u4-l4) * (u3-l3) * (u2-l2) * (u1-l1) + (i4-l4) * (u3-l3) * (u2-l2) * (u1-l1) + (i3-l3) * (u2-l2) * (u1-l1) + (i2-l2) * (u1-l1) + (i1-l1)]

int l5, l4, l3, l2, l1, u5, u4, u3, u2, u1;

void prin(int *m, int u5, int u4, int u3, int u2, int u1, int l5, int l4, int l3, int l2, int l1) {
	for(int i5 = l5; i5 < u5; i5++) {
		for(int i4 = l4; i4 < u4; i4++) {
	 		for(int i3 = l3; i3 < u3; i3++) {
	 			 for(int i2 = l2; i2 < u2; i2++) {
	 			 	for(int i1 = l1; i1 < u1; i1++) {
						printf("\t");
						printf("%d", flat5(m, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1));
					}
					printf("\n");
				}
				printf("    %d\n", i3);
			}
			printf("   %d ----------------\n", i4);
		}
		printf("%d ------------------------------\n", i5);
	}		
}
#pragma acc routine // line26
void gpu(int *restrict m, int u4, int u3, int u2, int u1, int l5, int l4, int l3, int l2, int l1, int i5, int i4, int i3, int i2, int i1) { // line 27
//void gpu(int *restrict m, int i5, int i4, int i3, int i2, int i1) { //line 28
	flat5(m, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1) = 10000*i5 + 1000*i4 + 100*i3 + 10*i2 + i1;
}

int main(void) {
	l5 = 1, l4 = 1, l3 = 2, l2 = 1, l1 = 0;  //lower limits
	u5 = 3, u4 = 2, u3 = 4, u2 = 5, u1 = 3;  //upper limits

	int *m = (int *)calloc((u5 - l5) * (u4 - l4) * (u3 - l3) * (u2 - l2) * (u1 - l1), sizeof(int));
	
	#pragma acc data copyout(m[0:(u5 - l5) * (u4 - l4) * (u3 - l3) * (u2 - l2) * (u1 - l1)])
	{
		#pragma acc parallel loop gang collapse(2)
		for(int i5 = l5; i5 < u5; i5++)
	 		for(int i4 = l4; i4 < u4; i4++)
	 			#pragma acc loop worker collapse(2)
	 			for(int i3 = l3; i3 < u3; i3++)
	 				for(int i2 = l2; i2 < u2; i2++)
	 					#pragma acc loop vector
	 			 		for(int i1 = l1; i1 < u1; i1++)
	 			 				gpu(m, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1); //line 48
	 			 				//gpu(m, i5, i4, i3, i2, i1); // line 49
	}
  						

	//prin(m, u5, u4, u3, u2, u1, l5, l4, l3, l2, l1);

	free(m);
	return 0;
}

first case - openacc
lines 26, 27, 48 uncomented and lines 28, 49 commented

gpu(int *, int, int, int, int, int, int, int, int, int, int, int, int, int, int):
     33, Generating acc routine seq
         Generating Tesla code
main:
     47, Generating copyout(m[:(u1-l1)*((u2-l2)*((u3-l3)*((u4-l4)*(u5-l5))))])
         Accelerator kernel generated
         Generating Tesla code
         49, #pragma acc loop gang collapse(2) /* blockIdx.x */
         50,   /* blockIdx.x collapsed */
         52, #pragma acc loop worker(4) collapse(2) /* threadIdx.y */
         53,   /* threadIdx.y collapsed */
         55, #pragma acc loop vector(32) /* threadIdx.x */
     52, Loop is parallelizable
     53, Loop is parallelizable
          55, Loop is parallelizable

Question_1
Do I need to send limits u4, u3 … l2, l1 to the function when they are global declared ?
Are they not firstprivate in “#pragma acc data” by default?
To send over 30 parameter in every function (in my program) a bit too much.

Question_2
I’ve tried “pragma acc declare” I saw some posts that is not ready yet, is it working in version in 17.10 ?

second case - (it is running only in CPU)
lines 26, 27, 48 commented and lines 28, 49 un-commented
Question_3
Just missing “line 33 Generating Tesla code” in the output, tell that is running in CPU, am I right ?

Question_1
Do I need to send limits u4, u3 … l2, l1 to the function when they are global declared ?

Not necessarily. You could put these global scalars into a “declare create” directive in which case they would have a global reference on the device that could then be accessed within device routines. Be sure to use an “update” directive to set the device values after you have updated the host copies.

Are they not firstprivate in “#pragma acc data” by default?

Scalars are firstprivate by default. But this only applies to the scalars that are used within the compute kernels. To have global data accessed within a device routine, you need a device global reference to them (via the declare directive).

Question_2
I’ve tried “pragma acc declare” I saw some posts that is not ready yet, is it working in version in 17.10 ?

No, it works (see example below). I think I know the post you’re referring to, though I can’t remember the details of the problem but I think it’s unrelated to your case.

Question_3
Just missing “line 33 Generating Tesla code” in the output, tell that is running in CPU, am I right ?

“Generating Tesla code” is saying that the compiler generated code that may run on a Tesla device. If it’s missing, then most likely no GPU code was created.

Here’s your example using “declare create” on the global variables. Note while the code will be offloaded to the GPU, some of loops are not parallelized due to the error “Loop without integer trip count will be executed in sequential mode”.

The trip count, i.e. the number of loop iterations, needs to be known when entering the loop. However here, you have an integer array that’s passed to a device routine, and use global integers for the loop bounds. Since the compiler at this point doesn’t have visibility as to what the device routine is doing, it must assume that the loop bounds variables could be changed in the routine (unlikely but possible) and there for making the loops uncountable. This was the case for your original example as well.

% pgcc tasica.2.c -ta=tesla:cc60 -Minfo=accel -Msafeptr
gpu:
     28, Generating acc routine seq
         Generating Tesla code
main:
     39, Generating update device(l5,u1,u2,u3,u4,u5,l1,l2,l3,l4)
         Generating copyout(m[:(u1-l1)*((u2-l2)*((u3-l3)*((u4-l4)*(u5-l5))))])
     41, Accelerator kernel generated
         Generating Tesla code
         42, #pragma acc loop seq collapse(2)
         43,   collapsed */
         45, #pragma acc loop seq collapse(2)
         46,   collapsed */
         48, #pragma acc loop vector(128) /* threadIdx.x */
     43, Loop without integer trip count will be executed in sequential mode
     45, Loop is parallelizable
     46, Loop without integer trip count will be executed in sequential mode
     48, Loop is parallelizable

The work around is to inline the routine so the compiler can see what the routine is doing. Alternately, you can assign the global variables to local variables and use the local variables as the loop bounds.

% cat tasica.2.c
#include<stdio.h>
#include<stdlib.h>

#define flat5(m0, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1) m0[(i5-l5) * (u4-l4) * (u3-l3) * (u2-l2) * (u1-l1) + (i4-l4) * (u3-l3) * (u2-l2) * (u1-l1) + (i3-l3) * (u2-l2) * (u1-l1) + (i2-l2) * (u1-l1) + (i1-l1)]

int l5, l4, l3, l2, l1, u5, u4, u3, u2, u1;
#pragma acc declare create( l5, l4, l3, l2, l1, u5, u4, u3, u2, u1)

 void prin(int *m) {
    for(int i5 = l5; i5 < u5; i5++) {
       for(int i4 = l4; i4 < u4; i4++) {
           for(int i3 = l3; i3 < u3; i3++) {
               for(int i2 = l2; i2 < u2; i2++) {
                  for(int i1 = l1; i1 < u1; i1++) {
                   printf("\t");
                   printf("%d", flat5(m, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1));
                }
                printf("\n");
             }
             printf("    %d\n", i3);
          }
          printf("   %d ----------------\n", i4);
       }
       printf("%d ------------------------------\n", i5);
    }
 }
 #pragma acc routine // line26
 void gpu(int *restrict m, int i5, int i4, int i3, int i2, int i1) { // line 27
    flat5(m, u4, u3, u2, u1, l5, l4, l3, l2, l1, i5, i4, i3, i2, i1) = 10000*i5 + 1000*i4 + 100*i3 + 10*i2 + i1;
 }

 int main(void) {
    l5 = 1, l4 = 1, l3 = 2, l2 = 1, l1 = 0;  //lower limits
    u5 = 3, u4 = 2, u3 = 4, u2 = 5, u1 = 3;  //upper limits

    int * restrict m = (int *)calloc((u5 - l5) * (u4 - l4) * (u3 - l3) * (u2 - l2) * (u1 - l1), sizeof(int));
#pragma acc update device( l5, l4, l3, l2, l1, u5, u4, u3, u2, u1)

    #pragma acc data copyout(m[0:(u5 - l5) * (u4 - l4) * (u3 - l3) * (u2 - l2) * (u1 - l1)])
    {
       #pragma acc parallel loop gang collapse(2)
       for(int i5 = l5; i5 < u5; i5++)
           for(int i4 = l4; i4 < u4; i4++)
              #pragma acc loop worker collapse(2)
              for(int i3 = l3; i3 < u3; i3++)
                 for(int i2 = l2; i2 < u2; i2++)
                    #pragma acc loop vector
                     for(int i1 = l1; i1 < u1; i1++)
                           gpu(m, i5, i4, i3, i2, i1); //line 48
    }


    prin(m);
    free(m);
    return 0;
 }
% pgcc tasica.2.c -ta=tesla:cc60 -Minfo=accel -Minline
gpu:
     28, Generating acc routine seq
         Generating Tesla code
main:
     39, Generating update device(l5,u1,u2,u3,u4,u5,l1,l2,l3,l4)
         Generating copyout(m[:(u1-l1)*((u2-l2)*((u3-l3)*((u4-l4)*(u5-l5))))])
     41, Accelerator kernel generated
         Generating Tesla code
         42, #pragma acc loop gang collapse(2) /* blockIdx.x */
         43,   /* blockIdx.x collapsed */
         45, #pragma acc loop worker(4) collapse(2) /* threadIdx.y */
         46,   /* threadIdx.y collapsed */
         48, #pragma acc loop vector(32) /* threadIdx.x */
     45, Loop is parallelizable
     46, Loop is parallelizable
          48, Loop is parallelizable
hsw8:/scratch/colgrove% a.out
        11210   11211   11212
        11220   11221   11222
        11230   11231   11232
        11240   11241   11242
    2
        11310   11311   11312
        11320   11321   11322
        11330   11331   11332
        11340   11341   11342
    3
   1 ----------------
1 ------------------------------
        21210   21211   21212
        21220   21221   21222
        21230   21231   21232
        21240   21241   21242
    2
        21310   21311   21312
        21320   21321   21322
        21330   21331   21332
        21340   21341   21342
    3
   1 ----------------
2 ------------------------------

Hi,

thanks for the tips, I have implemented in my program and it works!!! I am very happy :-)
For speed, I have manually inlined almost everything with regards of device…
I will check every line from the output and I will come back with questions.
I think there is still a lot to do, good news both functions for device have “Generating Tesla code”


UPDATE
For beginners like me I can recommend 3 things to boost the speed:
-flat all arrays (maybe contiguous has same results, I don’t know)
-create all arrays in device right in the beginning of program and the update them

  • manually inline the code in device