Overlapping acc enter data with computation

Hello,

I’m trying to overlap my computation with the transfer of the needed data. This currently does not work as I would expect and I have reduced my code to this small example:

#include <stdlib.h>

#define SIZE	256 * 1024 * 1024

// #define FASTER

int main(int argc, char* argv[])
{
	int i, j;
	double *a = (double*) malloc(sizeof(double) * SIZE);

	#pragma acc enter data copyin(a[0:SIZE/2]) async(1)
#ifndef FASTER
	#pragma acc enter data copyin(a[SIZE/2:SIZE/2]) async(2) wait(1)
#endif

	#pragma acc kernels present(a[0:SIZE/2]) async(1)
	for (i = 0; i < SIZE / 2; i++) {
		a[i] = a[i] * a[i] + a[i];
	}

#ifdef FASTER
	#pragma acc enter data copyin(a[SIZE/2:SIZE/2]) async(2) wait(1)
#endif

	#pragma acc kernels present(a[SIZE/2:SIZE/2]) async(2)
	for (i = SIZE / 2; i < SIZE; i++) {
		a[i] *= 2;
	}

	#pragma acc exit data delete(a[0:SIZE/2]) async(1)
	#pragma acc exit data delete(a[SIZE/2:SIZE/2]) async(2)

	#pragma acc wait

	free(a);

	return EXIT_SUCCESS;
}

I’m using version 16.9 of the compiler and the NVIDIA Visual Profiler tells me that the first computation is only started after the second transfer has already finished.
If I define ‘FASTER’, this time the second transfer is not started until the first computation has finished.

Is this expected or am I doing something wrong here?

Thanks,
Jonas

Hi Jonas,

There’s a couple of things going on here. First, let’s make a few corrections to your code. The way you have it now, you actually have two separate copies of the “a” array. The first of length “SIZE/2” and a second of length “SIZE”. To preserve indexing on the device, the compiler must create the array starting at element 0. However the copy would only be from SIZE/2 to SIZE leaving 0 to SIZE/2 uninitialized.

To fix your code, lets create all of “a” on the device, then copy the subarrays using an “update” directive.

% cat test_10_04_16.c
#include <stdlib.h>

 #define SIZE   256 * 1024 * 1024

 // #define FASTER

 int main(int argc, char* argv[])
 {
    int i, j;
    double *a = (double*) malloc(sizeof(double) * SIZE);

    #pragma acc enter data create(a[0:SIZE])

    #pragma acc update device(a[0:SIZE/2]) async(1)
    #pragma acc update device(a[SIZE/2:SIZE/2]) async(2) wait(1)

    #pragma acc kernels present(a[0:SIZE/2]) async(1)
    for (i = 0; i < SIZE / 2; i++) {
       a[i] = a[i] * a[i] + a[i];
    }


    #pragma acc kernels present(a[SIZE/2:SIZE/2]) async(2) wait(1)
    for (i = SIZE / 2; i < SIZE; i++) {
       a[i] *= 2;
    }

    #pragma acc wait
    #pragma acc exit data delete(a)

    free(a);

    return EXIT_SUCCESS;
 }

To perform the transfer, data must first be copied from page-able virtual memory into physical pinned memory. To accomplish this, each async queue creates a double buffering system where chunks of the array are copied from virtual memory to the pinned buffers. While one buffer is being transferred, the second buffer is being filled. The caveat being that the host is busy filling the buffers. Normally this isn’t much of a problem when using async, but because your 1GB sub-arrays are much larger that then buffers, you’re only seeing host/device overlap after the last buffer is transferred.

The easiest thing to do here, is compile with “-ta=tesla:pinned” where “pinned” will attempt to allocate device data in pinned rather than virtual memory. The caveat being that the OS may not honor the request and your device data needs to fit in physical memory.

Another method would be to increase the size of the buffers via the environment variable “PGI_ACC_BUFFERSIZE”. The caveat being if you make it too big, you’ll loose the benefits of double buffering.

Hope this helps,
Mat

Hi Mat,

thanks for your quick answer!

Is the second one really of length SIZE or does the compiler transform the indices? I’m asking because my real code has “a” larger than the device memory and with the asnyc transfers, I’m trying to make sure there that the device memory is never exceeded…

So the current implementation is not capable of doing the copy to the pinned buffers in the background?

Thanks, this has worked and gives me the expected overlap.

Cheers,
Jonas

Is the second one really of length SIZE or does the compiler transform the indices?

Transforming the indices would be fraught with errors. What would happen if you passed the array to a subroutine? No, per the OpenACC spec, the compiler must preserve the indexing and create an array of length SIZE.

I’m trying to make sure there that the device memory is never exceeded…

In this case you’ll need to use a blocking algorithm where you have a pointer into the array. I like to create the block on the device and then use the “acc_map_data” API call to map the current host block to device block.

So the current implementation is not capable of doing the copy to the pinned buffers in the background?

No. This requires spawning a host helper thread. We tried various implementations of this, but each had other problems.

  • Mat

I’m then wondering why the following code works on a GPU with 6 GB of memory

#include <stdlib.h>

#define SIZE_HOST	(8 * 1024 * 1024 * 1024L / sizeof(double))
#define SIZE_DEVICE	(4 * 1024 * 1024 * 1024L / sizeof(double))

int main(int argc, char* argv[])
{
	double* A = (double*) malloc(sizeof(double) * SIZE_HOST);

	#pragma acc enter data copyin(A[SIZE_HOST - SIZE_DEVICE:SIZE_DEVICE])

	#pragma acc exit data copyout(A[SIZE_HOST - SIZE_DEVICE:SIZE_DEVICE])

	free(A);

	return EXIT_SUCCESS;
}

but does not if I increase SIZE_DEVICE to 6 GB…

You are correct that there is an exception in the OpenACC standard that says “if a subarray is specified in a data clause, the implementation may choose to allocate memory for only that subarray on the accelerator.”

So yes, this may work but given it’s an implementation dependent feature, I still advise blocking with a device memory pool that is mapped and unmapped to each block of the sub-array.