I’m building a project using OpenACC. It compiles fun, but reports “Illegal address during kernel execution” errors during executions. (the last 2 lines below)
lisanhu@lisanhu-XPS-15-9550:runtime$ ./AccSeqC sequence-2.fa query-2-4.fa
Start reading sequence.
Done reading sequence: 0.080215s
Start indexing...
Done indexing 92.7572s
Start loading queries...
Done loading queries 0.000438929s
Start alignment...
Processing sequence 0
C size: 4 O size: 312504 R size: 2500001 ans size: 993840
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
It’s strange because I’m pretty sure that the vmem on my GPU is enough (2G for tens of MBs).
I also tried to run my project with nvprof. It reports that an assertion fails and have something related to stack. (information comes from the last few lines of below)
lisanhu@lisanhu-XPS-15-9550:runtime$ nvprof ./AccSeqC sequence-2.fa query-2-4.fa
Start reading sequence.
Done reading sequence: 0.0743251s
Start indexing...
Done indexing 95.5028s
Start loading queries...
Done loading queries 0.000409126s
Start alignment...
==5034== NVPROF is profiling process 5034, command: ./AccSeqC sequence-2.fa query-2-4.fa
Processing sequence 0
C size: 4 O size: 312504 R size: 2500001 ans size: 993840
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
==5034== Profiling application: ./AccSeqC sequence-2.fa query-2-4.fa
==5034== Warning: Found 1 invalid records in the result.
==5034== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==5034== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 39.138us 8 4.8920us 3.4770us 12.084us [CUDA memcpy HtoD]
==5034== API calls:
Time(%) Time Calls Avg Min Max Name
84.85% 106.03ms 1 106.03ms 106.03ms 106.03ms cuDevicePrimaryCtxRetain
9.33% 11.660ms 1 11.660ms 11.660ms 11.660ms cuMemHostAlloc
2.32% 2.9004ms 3 966.80us 2.3350us 2.8907ms cuStreamSynchronize
2.00% 2.5032ms 10 250.32us 102.73us 448.39us cuMemAlloc
0.63% 781.04us 1 781.04us 781.04us 781.04us cuModuleLoadData
0.54% 679.33us 1 679.33us 679.33us 679.33us cuLaunchKernel
0.25% 315.06us 1 315.06us 315.06us 315.06us cuMemAllocHost
0.05% 58.717us 8 7.3390us 5.2280us 17.797us cuMemcpyHtoDAsync
0.02% 20.325us 1 20.325us 20.325us 20.325us cuStreamCreate
0.00% 2.8410us 1 2.8410us 2.8410us 2.8410us cuModuleGetFunction
0.00% 1.9240us 3 641ns 139ns 1.5980us cuDeviceGetCount
0.00% 1.5510us 2 775ns 582ns 969ns cuCtxSetCurrent
0.00% 946ns 4 236ns 164ns 349ns cuDeviceGetAttribute
0.00% 899ns 2 449ns 240ns 659ns cuCtxSynchronize
0.00% 827ns 3 275ns 207ns 367ns cuDeviceGet
0.00% 565ns 2 282ns 134ns 431ns cuMemFree
0.00% 483ns 1 483ns 483ns 483ns cuMemFreeHost
0.00% 235ns 1 235ns 235ns 235ns cuDeviceComputeCapability
0.00% 212ns 1 212ns 212ns 212ns cuCtxGetDevice
nvprof: /home/buildmeister/build/rel/gpgpu/toolkit/r8.0/nvprof/src/openacc/nvprof_openacc.c:341: push_openacc_activity_stack: Assertion `activity->start <= activity->end' failed.
Aborted (core dumped)
Could someone help me with my code? What part did I do wrong? (code snippets are show below )
The first snippet is where I start using OpenACC. In the project, I need to deal with something similar to 2d-array, this is 2d-string. Because the length of each string is unknown, I concat all of the strings into qqs and build an array of offsets and lengths to compute actual start pointer and end position of the string. For example, qs should be started with qqs + offset and its length is lengths.
I’m not sure whether I’m using the correct directives or I miss something for this part. However, it’s also possible the problem happens in the stack of the callee function. I put it in the next code block.
#pragma acc data copyin(qqs[:total_size], lengths[:sz], offsets[:sz])
{
for (int i = 0; i < genes.size(); ++i) {
cout << "Processing sequence " << i << endl;
size_t c_sz = ACCSEQ_ALPHABET_SIZE;
array_size r_len = refs[i].len;
size_t o_sz = c_sz * (r_len / COMP_RATIO + 1);
const array_size *c = refs[i].c;
const array_size *o = refs[i].o;
const char *ref_bwt = refs[i].ref_bwt;
const array_size *rev_o = refs[i].rev_o;
const char *rev_bwt = refs[i].rev_bwt;
const size_t q_size = queries.size();
cout << "C size: " << c_sz << " O size: " << o_sz << " R size: " << r_len << " ans size: " << ans_sz << endl;
#pragma acc kernels copyin(c[:c_sz], o[:o_sz], ref_bwt[:r_len], rev_o[:o_sz], rev_bwt[:r_len]) copyout(ans[:ans_sz])
for (int j = 0; j < q_size; ++j) {
const char *query = qqs + offsets[j];
array_size q_len = lengths[j];
inexact_dfs_iter_search(query, q_len, ALLOWED_DIFFS, c, o, r_len, ref_bwt, rev_o,
rev_bwt, ans + (i * q_size + j) * 101, 100);
}
}
}
Below is the callee function. Originally, the stack size for each thread is different and controlled by q_length, however I find that if I put a variable here, the compiler will try to call a _mp_malloc routine and it can’t be compiled, so I put a constant int here for a demo. The value lll is 200 and it’s actually far from enough. Is it the reason for the problem?
This problem is driving me crazy. I’ll really appreciate it if someone could help me with it.
int
inexact_dfs_iter_search(const char *query, const array_size q_length, array_size allowed_diffs,
const array_size *c, const array_size *o, array_size r_length,
const char *ref_bwt, const array_size *rev_o, const char *rev_bwt,
Range *res, int num_of_res) {
const int lll = 200;
array_size d[lll];
calculateD(query, q_length, r_length, c, rev_o, rev_bwt, d);
Profile p{q_length - 1, 1, r_length - 1, (int) allowed_diffs};
const int prof_size = 9 * lll + 1;
Profile profs[prof_size];
Stack<Profile> profiles(profs, prof_size);
profiles.push(p);
Heap<Range> results(res, num_of_res);
while (!profiles.empty() && !results.full()) {
if (profiles.full()) {
return 1;
}
p = profiles.pop();
inex_dfs_process_profile(query, p, c, o, ref_bwt, d, profiles, results);
}
return 0;
}