OpenACC reporting "Illegal address during kernel execut

I’m building a project using OpenACC. It compiles fun, but reports “Illegal address during kernel execution” errors during executions. (the last 2 lines below)

lisanhu@lisanhu-XPS-15-9550:runtime$ ./AccSeqC sequence-2.fa query-2-4.fa 
Start reading sequence.
Done reading sequence: 0.080215s
Start indexing...
Done indexing 92.7572s
Start loading queries...
Done loading queries 0.000438929s
Start alignment...
Processing sequence 0
C size: 4 O size: 312504 R size: 2500001 ans size: 993840
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

It’s strange because I’m pretty sure that the vmem on my GPU is enough (2G for tens of MBs).

I also tried to run my project with nvprof. It reports that an assertion fails and have something related to stack. (information comes from the last few lines of below)

lisanhu@lisanhu-XPS-15-9550:runtime$ nvprof ./AccSeqC sequence-2.fa query-2-4.fa 
Start reading sequence.
Done reading sequence: 0.0743251s
Start indexing...
Done indexing 95.5028s
Start loading queries...
Done loading queries 0.000409126s
Start alignment...
==5034== NVPROF is profiling process 5034, command: ./AccSeqC sequence-2.fa query-2-4.fa
Processing sequence 0
C size: 4 O size: 312504 R size: 2500001 ans size: 993840
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
==5034== Profiling application: ./AccSeqC sequence-2.fa query-2-4.fa
==5034== Warning: Found 1 invalid records in the result.
==5034== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==5034== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  39.138us         8  4.8920us  3.4770us  12.084us  [CUDA memcpy HtoD]

==5034== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 84.85%  106.03ms         1  106.03ms  106.03ms  106.03ms  cuDevicePrimaryCtxRetain
  9.33%  11.660ms         1  11.660ms  11.660ms  11.660ms  cuMemHostAlloc
  2.32%  2.9004ms         3  966.80us  2.3350us  2.8907ms  cuStreamSynchronize
  2.00%  2.5032ms        10  250.32us  102.73us  448.39us  cuMemAlloc
  0.63%  781.04us         1  781.04us  781.04us  781.04us  cuModuleLoadData
  0.54%  679.33us         1  679.33us  679.33us  679.33us  cuLaunchKernel
  0.25%  315.06us         1  315.06us  315.06us  315.06us  cuMemAllocHost
  0.05%  58.717us         8  7.3390us  5.2280us  17.797us  cuMemcpyHtoDAsync
  0.02%  20.325us         1  20.325us  20.325us  20.325us  cuStreamCreate
  0.00%  2.8410us         1  2.8410us  2.8410us  2.8410us  cuModuleGetFunction
  0.00%  1.9240us         3     641ns     139ns  1.5980us  cuDeviceGetCount
  0.00%  1.5510us         2     775ns     582ns     969ns  cuCtxSetCurrent
  0.00%     946ns         4     236ns     164ns     349ns  cuDeviceGetAttribute
  0.00%     899ns         2     449ns     240ns     659ns  cuCtxSynchronize
  0.00%     827ns         3     275ns     207ns     367ns  cuDeviceGet
  0.00%     565ns         2     282ns     134ns     431ns  cuMemFree
  0.00%     483ns         1     483ns     483ns     483ns  cuMemFreeHost
  0.00%     235ns         1     235ns     235ns     235ns  cuDeviceComputeCapability
  0.00%     212ns         1     212ns     212ns     212ns  cuCtxGetDevice
nvprof: /home/buildmeister/build/rel/gpgpu/toolkit/r8.0/nvprof/src/openacc/nvprof_openacc.c:341: push_openacc_activity_stack: Assertion `activity->start <= activity->end' failed.
Aborted (core dumped)

Could someone help me with my code? What part did I do wrong? (code snippets are show below )

The first snippet is where I start using OpenACC. In the project, I need to deal with something similar to 2d-array, this is 2d-string. Because the length of each string is unknown, I concat all of the strings into qqs and build an array of offsets and lengths to compute actual start pointer and end position of the string. For example, qs should be started with qqs + offset and its length is lengths.

I’m not sure whether I’m using the correct directives or I miss something for this part. However, it’s also possible the problem happens in the stack of the callee function. I put it in the next code block.

#pragma acc data copyin(qqs[:total_size], lengths[:sz], offsets[:sz])
		for (int i = 0; i < genes.size(); ++i) {
			cout << "Processing sequence " << i << endl;
			size_t c_sz = ACCSEQ_ALPHABET_SIZE;
			array_size r_len = refs[i].len;
			size_t o_sz = c_sz * (r_len / COMP_RATIO + 1);
			const array_size *c = refs[i].c;
			const array_size *o = refs[i].o;
			const char *ref_bwt = refs[i].ref_bwt;
			const array_size *rev_o = refs[i].rev_o;
			const char *rev_bwt = refs[i].rev_bwt;
			const size_t q_size = queries.size();

			cout << "C size: " << c_sz << " O size: " << o_sz << " R size: " << r_len << " ans size: " << ans_sz << endl;

#pragma acc kernels copyin(c[:c_sz], o[:o_sz], ref_bwt[:r_len], rev_o[:o_sz], rev_bwt[:r_len]) copyout(ans[:ans_sz])
			for (int j = 0; j < q_size; ++j) {
				const char *query = qqs + offsets[j];
				array_size q_len = lengths[j];

				inexact_dfs_iter_search(query, q_len, ALLOWED_DIFFS, c, o, r_len, ref_bwt, rev_o,
												  rev_bwt, ans + (i * q_size + j) * 101, 100);


Below is the callee function. Originally, the stack size for each thread is different and controlled by q_length, however I find that if I put a variable here, the compiler will try to call a _mp_malloc routine and it can’t be compiled, so I put a constant int here for a demo. The value lll is 200 and it’s actually far from enough. Is it the reason for the problem?

This problem is driving me crazy. I’ll really appreciate it if someone could help me with it.

inexact_dfs_iter_search(const char *query, const array_size q_length, array_size allowed_diffs,
						const array_size *c, const array_size *o, array_size r_length,
						const char *ref_bwt, const array_size *rev_o, const char *rev_bwt,
						Range *res, int num_of_res) {
	const int lll = 200;
	array_size d[lll];
	calculateD(query, q_length, r_length, c, rev_o, rev_bwt, d);
	Profile p{q_length - 1, 1, r_length - 1, (int) allowed_diffs};
	const int prof_size = 9 * lll + 1;
	Profile profs[prof_size];
	Stack<Profile> profiles(profs, prof_size);

	Heap<Range> results(res, num_of_res);
	while (!profiles.empty() && !results.full()) {
		if (profiles.full()) {
			return 1;
		p = profiles.pop();
		inex_dfs_process_profile(query, p, c, o, ref_bwt, d, profiles, results);
	return 0;

What does your setup code look like, and could you specify your PGI and CUDA versions? I recently had a similar problem where I got illegal address errors in OpenACC code if I set a device ID other than zero, probably in conjunction with CUDA Fortran code I’m running in the same program. That one went away when switching to CUDA 8 / PGI 16.9.

Hi SanhuLi

An illegal address error on the GPU is a generic error similar to a seg fault on the host where the code is trying to access a bad memory location.

Some common causes are out-of-bounds accesses, accessing a host address on the device, or you’re out of heap space.

The value lll is 200 and it’s actually far from enough. Is it the reason for the problem?

In this case, my best guess is that you’re out of heap space or possibly stack. So yes, it could be lll’s size. Your program will have many thousands of threads so having each create arrays of size 1800*sizeof(Profile) will use up a lot of space.

Try running your code under the “cuda-memcheck” utility. It might give more details as to what’s wrong.

Also, try adding “loop gang(1) vector(1)” to your “kernels” region and change “kernels” to “parallel”. This should force the loop to be run sequentially and my get the program below any heap or stack limits. Rerun under cuda-memcheck.

  • Mat

Thank you so much for your help. I’ve found what’s the reason.

Although it reports something related to stack, actually I’ve found it’s a strange behaviour with data copyin.

Below is a simple demo that can illustrate the same scenario and works correctly

#include <openacc.h>

using namespace std;

int test_sum(int *arr, size_t sz, int base) {
for (int i = 0; i < sz; ++i) {
base += arr> ;
return base;

int main(int argc, char const *argv[]) {
int lenlens[] = {5, 4};
int all[] = {1,2,3,4,5,6,7,8,9};
int offsets[] = {0,5};

int vals[] = {0,1,2};
int ans[2];

#pragma acc data copyin(all[:9], lenlens[:2], offsets[:2])
for (size_t i = 0; i < 3; i++) {
int val = vals> ;
int *start = all;
start = (int *) acc_deviceptr((void *) start);
cout << hex << start << dec << endl;

#pragma acc parallel loop copyout(ans[:2])
for (size_t j = 0; j < 2; j++) {
int *arr = start + offsets[j];
size_t len = lenlens[j];

ans[j] = test_sum(arr, len, val);

for (size_t j = 0; j < 2; j++) {
cout << ans[j] << " ";
cout << endl;
// #pragma acc exit data delete(all[:9], lenlens[:2], offsets[:2], arrs[:2])

cout << “========ground truth=========” << endl;
for (size_t i = 0; i < 3; i++) {
int *start = all;
int val = vals> ;
cout << hex << start << dec << endl;

for (size_t j = 0; j < 2; j++) {
// int base = bases[j];
int *arr = start + offsets[j];
// cout << hex << arr << endl;
size_t len = lenlens[j];
cout << test_sum(arr, len, val) << endl;
// ans > = test_sum(j);

return 0;

If I remove the line start = (int *) acc_deviceptr((void *) start);, it will report the same error with my original code. I think after data copyin construct, the value of the pointer is not changed, it just copy the data into the device. I used to think this construct will also treat the arrays’ addresses to be the device ptr rather than the host ptr because I don’t need to call acc_deviceptr within a kernel region and it seems that my understanding is incorrect. Thank you so much for your help still.

Hi SanhuLi,

In this context, “start” is a scalar so is “firstprivate” by default. Hence it’s host value will be copied in and used to initialize each thread’s private copy. So when you set “start” to the device pointer value, all is well, but you’re passing in the host address when you’re not calling acc_deviceptr.

Without “acc_deviceptr”, you need to put “start” in a “present” clause to make it a global variable and have it’s device address be set.

#pragma acc parallel loop copyout(ans[:2]) present(start)
 for (size_t j = 0; j < 2; j++) {

Hope this helps,

Thank you so much. I don’t know how to use it that way and this is exactly what I need. I’ll read carefully about the reference when I’m free.