error: call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Thank You very much for answering my yesterday’ question https://forums.developer.nvidia.com/t/error-type-of-argument-does-not-match-formal-parameter-pgi-fixuk-param-0/136438/1 Really, if to drop -Mnollvm compiler option, the code at https://github.com/AndStorm/QUESTION.git compiles without errors.
I have a small question here, please. In my code i use cudaMemcpy() CUDA function. I thought that for using cudaMemcpy() i should use the “-Mnollvm” PGI compiler flag. But, as i understand now, i can use cudaMemcpy() without -Mnollvm, i. e. using LLVM compiler. Is it correct?

The code at https://github.com/AndStorm/QUESTION.git works properly on CPUs Intel Core i7 and 64-core KNL with PGI 19.4 compiler. But when i try to compile it for launching on GPU Titan V (-ta=tesla:cc70) installed in KNL (-tp=haswell) using PGI 19.4 + OpenAcc (OS - Ubuntu Stable, gcc 7.4.0), it compiles, but fails at runtime with an error:

size of DataHolder is ~0.0019564GB, size of d is ~0.0019564GB
##################################################
push=500000 INJ=500000 N=99999999
##################################################
Failing in Thread:1
call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Failing in Thread:1
call to cuMemFreeHost returned error 719: Launch failed (often invalid pointer dereference)

I use the compile line:

cmake . -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++
-DCMAKE_C_FLAGS="-acc -Minfo=acc -mcmodel=medium -ta=tesla:cc70 -tp=haswell -Minline -Mcuda=cuda10.1"
-DCMAKE_CXX_FLAGS="-acc -Minfo=acc -mcmodel=medium -ta=tesla:cc70 -tp=haswell -Minline -Mcuda=cuda10.1" -DCMAKE_CXX_STANDARD=17 -DACC=ON -DCUDA=ON

Using printf() on GPU, i found that the error occurs in T3DataHolder.h file somewhere in Propagate() function.
I lost any hope to find the error and do not know how to debug it on GPU.
Could You be so kind to help me fix this error and get the code to work?
Andrey

I thought that for using cudaMemcpy() i should use the “-Mnollvm” PGI compiler flag. But, as i understand now, i can use cudaMemcpy() without -Mnollvm, i. e. using LLVM compiler. Is it correct?

CUDA API calls are separate and distinct from which compiler code generator back-end is used. So cudaMemcpy can be used with either the LLVM or older non-LLVM back-end.

Could You be so kind to help me fix this error and get the code to work?

I can try, but the code errors since it can’t find the data directory. Can you please provide a data set so I can run the code?

% ./Test
-Warning-T3NSGangular_RW::default_file:Cant't get $T3_DATA. Use ./data
pdg=2112
secondaryZA=1 Z=0 secondaryA=1
Check: ./data/angular/N/incD/50/T3DSGangular_100010020.bin
RRR: Check in T3NSGangular_RW.cc:
RRR: fname=./data/angular/N/incD/50/T3DSGangular_100010020.bin
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
Abort

Though my guess is that you have a bad address in there someplace. You might try running with CUDA Unified Memory enabled (i.e. -ta=tesla:managed) to see if fixes the issue.

-Mat

I am afraid You used the code which was at https://github.com/AndStorm/QUESTION.git the day before yesterday, when i asked the question https://forums.developer.nvidia.com/t/error-type-of-argument-does-not-match-formal-parameter-pgi-fixuk-param-0/136438/1 Yesterday, before posting the question here, i updated the repository. Since yesterday the code does not depend on any external data sets and works properly for me on my machine on CPU. I tried to use -ta=managed flag as -ta=tesla:cc70,managed, but the code did not compile with an error:

PGCC/x86-64 Linux 19.4-0: compilation completed with warnings
/opt/pgi/19/u4/linux86-64-llvm/19.4/share/llvm/bin/opt: /tmp/pgc++0fMbCJ9L17dP.ll:12340:13: error: ‘@__pgi_managed_delete’ defined with type 'void (i8*)
call void @__pgi_managed_delete (i8
%11, i64 64) nounwind, !dbg !2750
^
make[2]: *** [CMakeFiles/Test.dir/build.make:63: CMakeFiles/Test.dir/nbody.cpp.o] Error 2
make[1]: *** [CMakeFiles/Makefile2:73: CMakeFiles/Test.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

I sent the code to the PGI customer service asking them to forward it to You to be sure You received the correct version of the code.
Thank You.
Andrey.

Hi Andrey,

I was using the package PGI Customer Service forwarded to me, though Alex may not have yet forwarded any additional packages you sent.

Using the updated git repo, I no longer see the data issue, but still get the throw:

% ./Test
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
Abort

Any suggestions on how to work-around this issue?

-Mat

Hi, Mat.
I do not know what may be wrong with the git repo, so i have just sent the letter with a .zip folder containing the source code to the PGI customer service. I have just run this code on my laptop on Ubuntu with gcc 7.4.0 using the compile line

cmake . -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
-DCMAKE_CXX_FLAGS="-march=native -mtune=native -O3"

Although the code threw some hard coded warning messages at the beginning, it worked to the end properly.
I have checked that all the data files are in the project directory,
so that this error should not repeat again.
Please, let me know if it works.
Hope it will work.
Andrey.

Hi Andrey,

I determined my issue. Since I do multiple builds, I run your cmake command from a “build” sub-directory. Running “Test” from this directory causes the error. Running from the top level directory works.

I’m able to reproduce the error. While I haven’t had time to dig into the problem, it does appear to have something to do with inlining.

As a work-around, I’m able to run successfully if I replace the “-Minline” flag with “-fast -Mnoautoinline” so no inlining is performed.

I’ll try and get back to this as soon as I can to determine the root cause. However, I’ll be attending a GPU Hackathon next week so am very time limited.

-Mat

% cmake ../ -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++ -DCMAKE_C_FLAGS="-acc -Minfo=acc -mcmodel=medium -ta=tesla:cc70 -fast -Mnoautoinline -Mcuda=cuda10.1" -DCMAKE_CXX_FLAGS="-fast -Mnoautoinline -acc -Minfo=acc -mcmodel=medium -ta=tesla:cc70  -Mcuda=cuda10.1" -DCMAKE_CXX_STANDARD=17 -DACC=ON -DCUDA=ON
...
% make
...
% cd ../
% bldpgi3/Test
-Warning-T3R_RW::default_file: Cant't get $T3_DATA. Using ./data

-Warning-T3R_RW::default_file: Cant't get $T3_DATA. Using ./data

OPENACC IS DEFINED, CUDA IS DEFINED
-Warning-T3R_RW::default_file: Cant't get $T3_DATA. Using ./data

-Warning-T3R_RW::default_file: Cant't get $T3_DATA. Using ./data

size of DataHolder is ~0.0019564GB, size of d is ~0.0019564GB
##################################################
push=200000 INJ=200000 N=999999
##################################################
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
1   320384    0   200000 9241.27
##################################################
push=400000 INJ=200000 N=999999
##################################################
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
2   624116    320384   400000 33282.7
##################################################
push=600000 INJ=200000 N=999999
##################################################
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
3   900434    624116   600000 75346.5
##################################################
push=800000 INJ=200000 N=999999
##################################################
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
4   1124057    900434   800000 136422
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
5   972373    1.12406e+06   800000 205245
##################################################
push=1000000 INJ=200000 N=999999
##################################################
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
6   1130436    972373   1000000 281228
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
7   904540    1.13044e+06   1000000 355891
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
8   718581    904540   1000000 420946
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
9   533643    718581   1000000 475297
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
10   366060    533643   1000000 516973
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
11   239688    366060   1000000 546297
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
12   151191    239688   1000000 565848
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
13   84446    151191   1000000 578349
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
14   40528    84446   1000000 585464
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
15   18798    40528   1000000 588918
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
16   8942    18798   1000000 590516
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
17   3078    8942   1000000 591272
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
18   1824    3078   1000000 591533
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
19   1856    1824   1000000 591688
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
20   832    1856   1000000 591847
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
21   64    832   1000000 591917
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
22   64    64   1000000 591923
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0
23   0    64   1000000 591929
NEUTRON_COUNT=0 He3_COUNT=0 PROTON_COUNT=0 TRITON_COUNT=0 Ti48_COUNT=0 KILLED_DEUTERON_COUNT=2153070 COUNT_IR3_IN_PROPAGATE=0 COUNT_IR3_IN_REACT=0 IR1=1212406 IR2=4162634 IR1/IR2=0.291259
deltaTls=0.09
Integral=5.21768e-08 ntid2=1.35942e+23
time=650 ms, G=27, K=1000000, Ntop=1130436, SumDG=591929
Nbin=64 FloatingType=d
NNN=1 cuba=1 INJ=200000 ag=0.5 um TARGET_WIDTH=1 um

I will be waiting for Your answer.
Thank You.
Andrey.

I reworked and simplified the code, so now there is not an actual version of the code in the git repo.
I have just sent a letter to the PGI Customer Service with the actual version of the code asking them to forward it to You. If You work on the code, please, take the updated version of the code attached to the letter.
Andrey.

Alex forwarded me your code. I’ll take a look as soon as I can, but it may be a day or two since I need to catch-up from being gone for a week.

-Mat