OpenACC Gang-Vector Performance

Hi,

In an application I have two nested for-loops.
The outer loop has n=128 iterations, the inner one k=512000.
I measured the best performance when I parallelised the outer one with gang and the inner one with vector.
However, when I skip the first m iterations in the outer loop, this results in a performance behaviour which I don’t understand:
For m=0 and m=1 the time for the whole kernel is about the same.
For m = 2 the time reduces to the half and stays about the same for m = 3, …, m = 127.

Here is a runnable code:

int main () {
	const int n = 128;
	const int k = 512000;

	double mat[n * k];
	double res[n];
	int i;

	for (i = 0; i < n * k; i++) {
		mat[i] = 2.1337;
	}


	#pragma acc data copyout(res[0:n]) copyin(mat[0:n*k])
	{
	int m;

	for (m = 0; m < n; m++) {

		double start = omp_get_wtime();

		#pragma acc parallel present(res[0:n])
		#pragma acc loop gang
		for (i = m; i < n; i++) {
			int j;
			double sum = 0.0;

			#pragma acc loop vector reduction(+:sum)
			for (j = 0; j < k; j++) {
				sum += pow(mat[i * k + j], i);
			}

			res[i] = sum;
		}

		double end = omp_get_wtime();

		printf("m = %d, time = %fms\n", m, (end - start) * 1.0e3);
	}

	} /* acc data */

	return 0;
}

In a similiary CUDA implementation the time decreases with increasing m - as I would expect.
Why does OpenACC behave different here?

Thanks,
Fabian

Hi Fabian,

Here’s what I’m seeing. By default with 15.5, the times increase steadily:

% pgcc -acc -Minfo=accel main.c -V15.5
main:
     21, Generating copyout(res[:n])
         Generating copyin(mat[:k*n])
     29, Generating present(res[:n])
         Accelerator kernel generated
         31, #pragma acc loop gang /* blockIdx.x */
         36, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sum
     29, Generating Tesla code
     36, Loop is parallelizable
% a.out
m = 0, time = 12.600183ms
m = 1, time = 12.555838ms
m = 2, time = 12.560129ms
m = 3, time = 12.552977ms
m = 4, time = 12.401819ms
... cut ... 
m = 21, time = 11.390924ms
m = 22, time = 11.324883ms
m = 23, time = 10.879040ms
m = 24, time = 10.879040ms
... cut ...
m = 125, time = 9.262800ms
m = 126, time = 9.274006ms
m = 127, time = 9.259939ms
res=1092454.400000

If change to use a vector length of 256, then I see a similar halving of the time, but much later.

% pgcc -acc -Minfo=accel main.c -V15.5
main:
     21, Generating copyout(res[:n])
         Generating copyin(mat[:k*n])
     29, Generating present(res[:n])
         Accelerator kernel generated
         31, #pragma acc loop gang /* blockIdx.x */
         36, #pragma acc loop vector(256) /* threadIdx.x */
             Sum reduction generated for sum
     29, Generating Tesla code
     36, Loop is parallelizable
% a.out
m = 0, time = 16.785860ms
m = 1, time = 16.723871ms
m = 2, time = 16.201973ms
m = 3, time = 16.170025ms
m = 4, time = 15.873194ms
m = 5, time = 15.868187ms
m = 6, time = 15.871048ms
m = 7, time = 15.732050ms
m = 8, time = 11.610031ms
...cut...
m = 66, time = 10.181904ms
m = 67, time = 10.140896ms
m = 68, time = 5.842924ms
m = 69, time = 5.838871ms
m = 70, time = 5.809069ms
m = 71, time = 5.835056ms
m = 72, time = 5.827188ms
m = 73, time = 5.826950ms
m = 74, time = 5.815029ms
... cut ...
m = 124, time = 4.744053ms
m = 125, time = 4.712105ms
m = 126, time = 4.716158ms
m = 127, time = 4.704952ms
res=1092454.400000

Most likely in my case, the point at which my K40 gets saturated with the maximum number of gangs that run at the same time is when m=68. Prior to this, one or more gangs need to wait until there is space, effectively doubling the time.

Something similar may be happening on your system.

  • Mat

Hi Mat,

Thanks for your hint, increasing the vector_length helped!
In my CUDA implementation of this code I used dynmaic parallelism to launch a reduction kernel for each iteration of the outer loop after each thread has calculated its sum (k/blockDim entries). This probably helped to saturate the GPU, as I launched the reduction asynchronously.

One more question: if I use something different than a power of two for vector_length, the program crashes with

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Is this normal (in CUDA this works fine)?

Thanks,
Fabian

if I use something different than a power of two for vector_length, the program crashes with

No, it should be fine. You’ll just be wasting threads if the vector length isn’t a multiple of 32.

What value are you using? I tried several odd values, 245, 253, 999, and the program ran correctly.

  • Mat

Hi Mat,

Your values work for me, too.
However, when I use for example 192, 219, 224 (225 works), 129 it crashes.
My suspicion is that for values below 225 only multiples of two work as vector length and from 225 on every number works (of course I didn’t test all).
I am using pgi 15.4 and compiling with

pgcc -fastsse -mp -acc -Minfo=accel -ta=nvidia,pin,cc35 -c main.c
pgcc -fastsse -mp -acc -Minfo=accel -ta=nvidia,pin,cc35 -o test.exe main.o -lm -pgcpplibs

Thanks,
Fabian