I’m getting the following error when I run a simple C++ program with an OpenACC pragma: “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution.”
Here’s the full program:
#include <vector>
void processArray(std::vector<int> &arr) {
int size = arr.size();
#pragma acc parallel loop
for (int i = 0; i < size; i++) {
arr[i] = i;
}
}
int main(void) {
std::vector<int> arr(1);
processArray(arr);
return 0;
}
However, there’s no error if array processing happens inside “main”:
#include <vector>
int main(void) {
std::vector<int> arr(1);
int size = arr.size();
#pragma acc parallel loop
for (int i = 0; i < size; i++) {
arr[i] = i;
}
return 0;
}
Could someone please help me understand what the issue with the former example is?
Compilation command:
pgc++ -acc -ta=tesla:managed main.cpp -o main
GPU: GeForce GTX 1060 6GB
Compiler: pgc++ 18.10-1 64-bit target on x86-64 Linux -tp zen
OS: Ubuntu Linux 18.04.2 LTS (Bionic Beaver)
CUDA Unified Memory only manages dynamic data, not static. So while arr’s data will be managed, arr itself isn’t.
In the second working example, the compiler is implicitly copying arr for you. However in the first example arr is a reference so the compiler wont be able to implicitly copy it. The fix is to explicitly copy in arr.
% setenv PGI_ACC_TIME 1
% cat test1.cpp
#include <vector>
void processArray(std::vector<int> &arr) {
int size = arr.size();
#pragma acc parallel loop copyin(arr)
for (int i = 0; i < size; i++) {
arr[i] = i;
}
}
int main(void) {
std::vector<int> arr(1);
processArray(arr);
return 0;
}
% pgc++ -ta=tesla:managed -Minfo=accel test1.cpp; a.out
processArray(std::vector<int, std::allocator<int>> &):
5, Generating copyin(arr[:])
Generating Tesla code
6, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
std::vector<int, std::allocator<int>>::operator [](unsigned long):
1, include "vector"
57, include "vector"
7, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
Accelerator Kernel Timing data
/local/home/colgrove/test1.cpp
_Z12processArrayRSt6vectorIiSaIiEE NVIDIA devicenum=0
time(us): 786
5: compute region reached 1 time
5: kernel launched 1 time
grid: [1] block: [128]
device time(us): total=765 max=765 min=765 avg=765
elapsed time(us): total=832 max=832 min=832 avg=832
5: data region reached 2 times
5: data copyin transfers: 1
device time(us): total=21 max=21 min=21 avg=21