error: Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Hallo. I asked a question https://forums.developer.nvidia.com/t/pgcc-s-1000-call-in-openacc-region-to-procedure-memmove-which-has-no-acc-routine-information/136234/1 the day before yesterday.
I very strongly reworked the code according to the given advice: https://github.com/AndStorm/openacc.git.
Now the code compiles for launching on GPU (using OpenAcc directives for executing on GPU) without errors, but when i launch it, it fails with an error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I use the compile line:

cmake . -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++ - 
DCMAKE_CXX_FLAGS="-acc -mcmodel=medium -ta=tesla:cc30,managed -Mnollvm - 
fast -Mcuda=cuda10.1" -DCMAKE_CXX_STANDARD=17

-DCMAKE_CXX_STANDARD=17 is necessary for compilation of T3ParticleTable.cpp and T3MaterilaTable.cpp files.
OpenAcc directives are only in T3Process.h and T3DataHolder.h files.
When i try to use in nbody.cpp main()

DataHolder<FloatingType> data;
#pragma acc data create(data)
{...}

instead of

 
cudaMalloc((void**) &data, sizeof(DataHolder<FloatingType>));
 ...
cudaFree(data)

the program fails with:

Segmentation Fault (core dumped)

What is it? Please, help me get the code to work on GPU.
I use PGI 19.4 pgc++ compiler, CUDA 10.1, CUDA Driver version 418.67, Fedora 23 x86_64 OS and GeForce GTX 650 Ti GPU. Also gcc 5.3.1 is on my machine.
The program works properly on CPU with the following compile line (please, comment //#define OPENACC in T3Defs.h at line 12):

cmake . -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++ - 
DCMAKE_CXX_FLAGS="-acc -ta=multicore -fast" -DCMAKE_CXX_STANDARD=17

Hi @and,

I downloaded and run your code, but see a different error than what you report. In my case, the code aborts since “particles” isn’t present when entering the first parallel region. The reason being that the class variable “data” is getting allocated using cudaMalloc so you’d need to “deviceptr” in place of “present” to tell the compiler that this is a CUDA pointer. “present” can only be used with data managed via the OpenACC data directives. I simplified this to just “deviceptr(this)” given you really need to tell the compiler that the this pointer is a CUDA pointer.

While this gets me past the present error, the code later seg faults since “data” is dereferenced on the host. Device pointers can’t be dereferenced on the host.

If I change nbody.cpp to allocate data on the host and then let CUDA unified memory (i.e. -ta=tesla:managed) mange the data, the code runs fine.

I can also manually manged the data (i.e. just -ta=tesla, no managed) by adding a use “enter data create(data)” after the creation of data, then add “present(this)” in place of your “present(particles)”.

Not sure it’s getting correct answers since there may be missing updates, but it does run to completion.

Hope this helps,
Mat

I tried:

  1. replaced all 'present(particles,…)" clauses with deviceptr(this), as You told, allocated “DataHolder *data=new DataHolder();” on the host and then set “-ta=tesla:cc30,managed” to use CUDA unified memory to manage the data, but when i launch the code, it writes in the beginning:
    pool allocator: Specified pool size too big for this device
    no matter what the size of the data is (i minimized it).
    The code runs on GPU, but it should work the same as on CPU either with OpenMP or OpenAcc (get equal results), but it works differently. May be the reason is: pool allocator: “Specified pool size too big for this device” for any data size. How to avoid it?
  2. Added #pragma acc enter data create(data) after “DataHolder *data=new DataHolder();” and replaced all “present(particles,…)” with “present(this)” in DataHolder.h, but the execution fails with the error:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

If You could send me Your updated code, i would be very grateful.