gang and worker

In order to test gang and worker clause, these is a small program.

#include<stdio.h>
#include<stdlib.h>
#define N 1000
#define M 1000

int  main()
{
	int *A;

	A=(int *)malloc(N*M*sizeof(int));
	
	for(int i=0;i<N*M;i++){
			A[i]=-1;
	}

	#pragma acc kernels loop gang(100),worker(128)
	for(int i=0;i<N*M;i++)
	{
		A[i]=i;	
	}
		
	for(int i=0;i<10;i++)
		printf("A=%d\n",A[i]);
	return 0;
}

Under the linux os , compile information :
[wcj@localhost example]$ pgcc -acc -Minfo gang.c
NOTE: your trial license will expire in 8 days, 13.7 hours.
main:
12, Memory set idiom, loop replaced by call to __c_mset4
17, Generating present_or_copyout(A[0:1000000])
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
18, Loop is parallelizable
Accelerator kernel generated
18, #pragma acc loop gang(100), vector(128) /* blockIdx.x threadIdx.x */
And the execution information:
[wcj@localhost example]$ ./a.out
A=0
A=1
A=2
A=3
A=4
A=5
A=6
A=7
A=8
A=9

Accelerator Kernel Timing data
/home/wcj/Yunio/openacc/example/gang.c
main NVIDIA devicenum=0
time(us): 703
18: kernel launched 1 times
grid: [7813] block: [128]
device time(us): total=80 max=80 min=80 avg=80
elapsed time(us): total=96 max=96 min=96 avg=96
27: data copyout reached 1 times
device time(us): total=623 max=623 min=623 avg=623
My question is that why grid number is not equal to the number that set the value in gang clause.
And I test the program under window OS using the PGI workstation. From the execution information, I know grid number is equal to the number that set the value in gang clause.
why the result of grid number are not the same under different system. Maybe it is the bug of the linux version of PGI compiler?

[/list][/code][/quote]

Maybe it is the bug of the linux version of PGI compiler?

Yes, it’s a known compiler issue (TPR#19149) that’s expected to be fixed in next month’s (May 2013) 13.5 release.

Note that you should be using “vector” instead of “worker” since “worker” corresponds to the warp size which is fixed on NVIDIA GPUs.

  • Mat

thanks ,Mat

FYI, I’ve confirmed that 13.5 will give you the correct gang size:

% pgcc -acc -Minfo=accel -V13.5 uestc0626.c 
main:
     16, Generating present_or_copyout(A[0:1000000])
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     17, Loop is parallelizable
         Accelerator kernel generated
         17, #pragma acc loop gang(100), vector(128) /* blockIdx.x threadIdx.x */
% setenv PGI_ACC_TIME 1
% a.out
A=0
A=1
A=2
A=3
A=4
A=5
A=6
A=7
A=8
A=9

Accelerator Kernel Timing data
uestc0626.c
  main  NVIDIA  devicenum=0
        time(us): 381
        17: kernel launched 1 times
            grid: [100]  block: [128]
             device time(us): total=53 max=53 min=53 avg=53
            elapsed time(us): total=68 max=68 min=68 avg=68
        22: data copyout reached 1 times
             device time(us): total=328 max=328 min=328 avg=328