Can I initialize some data within a compute region?

Hello all,

I’m currently doing a task that needs to dynamically allocate some GPU memory within a routine / compute region. It’s something similar to the below code block.

#include <iostream>

using namespace std;

#pragma acc routine seq
int *routine(int init) {
	int *ptr;
	#pragma acc data create(ptr[:10])
	for (int i = 0; i < 10; ++i) {
		ptr[i] = init + i;
	}
	return ptr;
}

void print_array(int *arr) {
	for (int i = 0; i < 10; ++i) {
		cout << arr[i] << " ";
	}
	cout << endl;
}

int main(void) {
	int *arrs[5];

#pragma acc kernels
	for (int i = 0; i < 5; ++i) {
		arrs[i] = routine(i);
	}

	for (int i = 0; i < 5; ++i) {
		print_array(arrs[i]);
	}
	return 0;
}

in this demo code, I tried to run the routine for a few times and inside the routine, This is something similar to my task. I’m trying to create some data. The problem is that in my task I cannot create the data out of the region because the size of the elements to be allocated cannot be decided until runtime (and the estimation of maximum possible elements will explode my GPU).

I can compile it and below is what I got for this demo from

lisanhu@lisanhu-XPS-15-9550:create_and_copyout$ pgc++ -o test main.cc -acc -Minfo=accel
routine(int):
      6, Generating acc routine seq
main:
     23, Generating implicit copyout(arrs[:])
     26, Accelerator restriction: size of the GPU copy of arrs is unknown
         Loop is parallelizable
         Generating implicit copy(arrs[:][:])
         Accelerator kernel generated
         Generating Tesla code
         26, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

The running result


lisanhu@lisanhu-XPS-15-9550:create_and_copyout$ ./test 
call to cuStreamSynchronize returned error 715: Illegal instruction

Is there some way for me to do this? Really appreciate it.

Hi SanhuLi,

There are number of problems here. First, your program isn’t a valid C/C++ program since “ptr” needs to allocated. Also, you can’t put an OpenACC data region within a compute region or “routine”. You can dynamically allocate data on the device by calling malloc but it is not recommended. Having each thread allocate data on the device is very slow and by default there’s only 8MB of heap. Even if you do create device side memory, it doesn’t persist between kernel calls and wouldn’t be accessible from the host.

The problem is that in my task I cannot create the data out of the region because the size of the elements to be allocated cannot be decided until runtime (and the estimation of maximum possible elements will explode my GPU).

Do you have an example from your original code showing what you’re trying to do? Once I have a better understand of the problem, hopefully can give you better recommendations on how to solve it.

  • Mat

Hello Mat,

Thank you so much for your help. I know that demo is not valid but that’s the best I can do to summarize my problem.

I’m building a project to do sequence alignment for genes and sequence reads. For example, a simple task is to align perhaps more than a million queries to a thousand genes. It’s something like a substring matching, each query can be matched at some position, for example, the query ‘a’ could be matched within ‘ababaaaa’ in 6 positions.

It’s obvious this task can be done by assign each worker a single query work like this to improve the performance. While because I can’t decide how many positions for a query within a gene, so I may need to store different positions for different queries and genes. The size of the possible positions is as many as the length of the gene. It’s obvious a bad guess but that’s how the worst case works. The size of genes in real data could vary. Perhaps from a few hundred to a few thousand, but you can’t guarantee. So I need a space to store the results and return it to the host.

That’s about how my program works. I can email the code to you if you want because it’s a long code. My problem here is that I don’t know the size of the worker’s output and I can’t pre-allocate the memory for the result.

Hi SanhuLi,

What I’m thinking is that you might want to use a pool of memory on the device and then keep track of where each gene’s value is stored as well as the length. You’d need to use an atomic operation which is slow and the max size of the pool would be fixed, but you could then access the pool from the host, re-use it between compute regions, not have the overhead of dynamic allocation on the device, and not be limited to 32MB (max device heap size).

Here’s a basic example I put together:

% cat pool.c

#include <stdlib.h>
#include <stdio.h>

#define MAX_GENES 512
#define MAX_SIZE (MAX_GENES*400)

int main() {

    int * pool_data;
    int *idx, *idx_size;
    int pool_size, i, index, total_errors;

    pool_size=MAX_SIZE;
    pool_data = (int*) malloc(pool_size*sizeof(int));
    idx = (int*) malloc(MAX_GENES*sizeof(int));
    idx_size = (int*) malloc(MAX_GENES*sizeof(int));
    index = 0;

#pragma acc enter data create(idx[0:MAX_GENES], idx_size[0:MAX_GENES], pool_data[0:pool_size]) copyin(index)

#pragma acc parallel loop present(pool_data,idx,idx_size,index)
    for (i=0; i <MAX_GENES;++i) {
        int myidx, mysize;
        if (i%4) {
           mysize=100;
        } else {
           mysize=1000;
        }
#pragma acc atomic capture
        {
          myidx = index;
          index += mysize;
        }

        if (myidx+mysize < pool_size) {
           idx[i] = myidx;
           idx_size[i] = mysize;
           for (int j=0; j < mysize; ++j) {
             pool_data[myidx+j]  = i;
           }
        } else {
           // Error, overflowed the pool
           idx[i] = -1;
           idx_size[i] = mysize;
        }
    }

#pragma acc update self(idx[0:MAX_GENES],idx_size[0:MAX_GENES],pool_data[0:pool_size])

    total_errors = 0;
    for (i=0;i<MAX_GENES;++i) {
        int myidx, mysize;
        int err=0;
        myidx=idx[i];
        mysize=idx_size[i];
        if (myidx < 0) {
           printf("Gene #%d overflowed the pool\n",i,mysize);
           ++total_errors;
        } else {
           for (int j=0;j<mysize;++j) {
              if (pool_data[j+myidx] != i) {
                ++err;
              }
           }
           if (err > 0) {
             ++total_errors;
             printf("Index %d had %d errors.\n",i,err);
           }
        }
     }
     if (total_errors == 0) {
        printf("No Errors\n");
     }

#pragma acc exit data delete(idx,idx_size,pool_data)
     free(idx);
     free(idx_size);
     free(pool_data);


}
% pgcc pool.c -acc -Minfo=accel
main:
     22, Generating enter data create(idx_size[:512])
         Generating enter data copyin(index)
         Generating enter data create(pool_data[:pool_size],idx[:512])
     24, Generating present(index,pool_data[:],idx[:],idx_size[:])
         Accelerator kernel generated
         Generating Tesla code
         25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         41, #pragma acc loop seq
     41, Loop carried reuse of pool_data-> prevents parallelization
     53, Generating update self(pool_data[:pool_size],idx_size[:512],idx[:512])
     78, Generating exit data delete(pool_data[:1],idx_size[:1],idx[:1])
% a.out
No Errors

Hope this helps,
Mat