Libgomp: cuStreamSynchronize error: an illegal memory access was encountered

Hi, all!

I’ve got a very strange error in simple OpenACC code:

libgomp: cuStreamSynchronize error: an illegal memory access was encountered

The code below is so simple that I cannot realize why several CUDA streams are necessary there

double AdditiveSynapseGroupOpenAcc::GetCurrentSynapticInput() const
{
double ret = 0.;
#pragma acc parallel loop reduction(+:ret)
for (unsigned i = 0; i < nsynapses_; ++i)
if (syn_[i].delay_)
{
if (syn_[i].delayed_spikes_ & (1 << (syn_[i].delay_ - 1)))
ret += syn_[i].weight_;
}
return ret;
}

syn_ is a member of AdditiveSynapseGroupOpenAcc

synapse *syn_;

struct synapse
{
unsigned delay_ = 0; // 0 means unconnected
float weight_ = 0.F;
size_t delayed_spikes_ = 0;
};

Compiler is gcc 9.
Runs on Ubuntu 20.04 with 4 Titan Xp

Please help!

Sorry, compiler is gcc 10

Hi mikekis,

Do you have a complete reproducing example you can provide? In particular, I’d want to see how you are creating the “syn_” structure on the device. You may also consider adding a “present(syn_)” on the parallel loop so the compiler doesn’t attempt to implicitly copy it. (Of course this is assuming that you’re copying “syn_” at a higher level).

-Mat

Thank you for your answer. This code is a part of big project and we plan to move gradually the most parallel parts of it to OpenACC. This code is one of firsts steps in this direction. Respectively, it is not easy to extract it as self-containig separate example.
I do not create syn_ on the device explicitely. I expected that OpenACC would transfer it to device (knowing its size from the loop limits [0, nsynapses_]), perform the processing, and destroy it. Am I wrong? Shoud I transfer it explicitely?
I understand that this code is not very efficient but it is rather first test step…

Early versions of the OpenACC standard left it open for compiler implementation on how to handle implicit copying of pointers. While I don’t use GNU myself nor know their interpretation, I would expect they are following the current standard, which clarifies the behavior, where pointers implicitly use the “no_create” clause which will create a local private copy of the pointer with it’s value initialized from the host address. I highly doubt they will implicitly copy the full array using the loop bounds.

Instead, try adding “copyin(syn[0:nsynapses]” to the parallel directive to make the copy explicit and in line with the current OpenACC standard.

-Mat

Thank you. It would be nice, but I cannot do it:

error: “snn_library::calculation_backend::AdditiveSynapseGroupOpenAcc::syn_” is not a variable in map clause
16 | #pragma acc parallel loop reduction(+:ret) copyin(syn_[0:nsynapses_])

while syn_ is declared as member variable in AdditiveSynapseGroupOpenAcc

synapse *syn_ = NULL;

The separate pragma
#pragma acc enter data copyin(syn_[0:nsynapses_])
gives the same effect.

My guess is that this is a limitation in GNU’s support of OpenACC. I recall that they had issues with aggregate types, which may or may not be supported in later versions. You might try g++ 11, or install nvc++, part of the NVIDIA HPC SDK, which I’ll be better able to help with since I’m part of the NV HPC compiler team.

while syn_ is declared as member variable in AdditiveSynapseGroupOpenAcc

In this case, you also need to copy the this pointer over to the device. What I typically do, is in the class constructor, add the enter data directive which copies the this pointer as well as it’s data members.

For an example, please see the example code I wrote for the chapter I wrote for the book Parallel Programming with OpenACC : ParallelProgrammingWithOpenACC/Chapter05 at master · rmfarber/ParallelProgrammingWithOpenACC · GitHub
In particular, “accList” is a basic vector-like container class which may or may not apply to your code, but you can use the same techniques.

-Mat