Hi Po Chun,
For #1, what I was writing about would only apply if you’re using “malloc” within device code. If you’re only using malloc from the host, then this error is coming from something else, though in the code snip-it you showed, “t” and “q” were within an “acc loop independent”, which assuming is within an outer compute region, would be performed on the device.
Could you post or send to PGI Customer Service (trs@pgroup.com) an example where I could reproduce the issue? I don’t want to point you in the wrong direction which can happen if I given only partial snipits of the code.
Note, an “illegal address” error is similar to a host-side seg fault where a bad device address was dereferenced. It’s somewhat generic in that it could be coming from dereferencing a host address, an out-of-bounds memory access, accessing a null ptr, stack/heap overflows, etc. Although it’s listed as coming from the call to “cuMemcpyDtoHAsync”, it most likely that the kernel that runs before this call is actually causing the error, but the CUDA runtime doesn’t trigger the error until the next API call.
I would suggest setting the environment flag “PGI_ACC_DEBUG=1” to see which OpenACC compute region (i.e. which CUDA kernel) is executed before the error. It wont tell you why it’s erroring, but just where to start looking.
For #2, the compiler will attempt to auto-parallelize loops when using the “kernels” compute region or loops within a “parallel” compute region that are not explicitly marked with a “loop” directive. Though to safely auto-parallelize a loop, the compiler must first prove that the loop does not contain dependencies.
In C/C++ since pointers of the same type are allowed to be aliased, the compiler must assume that they are. Since aliased pointers aren’t safe to parallelize, this often prevents auto-parallelization. You can force parallelization by adding a “loop” directive when using a “parallel” compute region, or a “loop independent” directive when using “kernels”. Though it’s up to the user to make sure that the code is actually independent and safe to parallelize.
Again, having a full example would help me here. I can’t tell from this info if the problem is potential aliasing between “rdata”, “data”, and “limage->rdata” or something else.
Also, are you using Vectors? Vectors are not thread safe so should be used with caution in parallel execution. It’s usually ok if all you’re doing is reading or writing to an existing vector, but will become problematic if you are trying to push or pop from the list. Vectors are also difficult to perform device data management on since they simply consist of just three pointers which need to get translated to device pointers. Possible to do, but it much easier to use CUDA Unified Memory (-ta=tesla:managed) when using vector since both the device and host can access the same pointer addresses.
-Mat