Using Volatile/shared memory in OpenACC Kernels Region

I’m currently working on some code where I want to play some tricks with OpenACC’s Vector form. Specifically, I know that the compiler generates the right code for a vector loop (inside of a gang) that requires intra-warp synchronization, but no more than this. Since I know that I don’t have to synchronize outside of a warp and I know that warps execute in lockstep, I want to declare some memory volatile and then have the PGI OpenACC compiler generate the right stuff. Here’s an example:

#define DO(i,n) for(size_t i = 0; i < (n); i ++)
#pragma acc loop independent gang
DO(i,...){volatile int ta[...];
  #pragma acc loop independent vector(1024)
    ta[...] += ....;
    ta[...] += ...;}}

I’ve examined the kernels that the compiler generates and they are basically what I want, except that I can’t get the compiler to accept the volatile keyword. Actually, I have two problems. I want ta to be in shared memory for faster work, but if the array ever gets into shared memory (such as the size is small enough to automatically be put there by the compiler, or if I use a cache directive, or I lift it out and use a private directive), then the code computes the incorrect result. The second problem is that it won’t accept the volatile keyword.

The result is that in the kernel, the pointer to the ta array is not declare volatile, and the intrawarp synchronization doesn’t work because the backend compiler will do some optimizations on those writes, leading to the incorrect results.

I’d like to do this strictly in OpenACC, as I don’t want to have a separate .cu file with device functions that I create for this purpose if possible.

Other than the compiler not accepting the volatile keywords, when the compiler puts ta or any other array declared at that level in shared memory, I get the following error:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I might also get some other error about launch problems or the like, but this one seems like a common one. I am not sure why the shared memory isn’t working right here.

Now, assuming that I don’t have shared memory problems, if I split each of the lines in that vector loop into its own loop and execute each one right after the other, I get the results I expect. This is evident from the kernel code, because a __syncthreads() is inserted after each loop. I want to avoid the need for the synchthreads after each loop, since it isn’t necessary here.

Any hints on how to deal with these issues?

Hi Aaron,

The PGI compiler does accept “volatile”, it’s just applying it to the code you wrote. What you want is for there to be a way to pass “volatile” to the back end CUDA code generator. There is not a way to do this from the C code and we really resist adding target specific extensions to OpenACC. OpenACC is meant to be portable across many target devices. For low level device specific optimization like you’re wanting here, the suggestion would be to write this kernel in CUDA C and compile with nvcc. OpenACC is full interoperable with CUDA C, so you can link the object as part of you’re larger program.

  • Mat

Thanks for the response. I appreciate that OpenACC is kept fairly clean of target specific extensions. The main problem is that I don’t really get the opportunity to pick and choose here. I could do this in CUDA if I wanted to do so, but the nature of what I’m doing would pretty much necessitate generating CUDA for everything else that I’m doing as well, since this is part of a large file that is being automatically generated. It could be done, but at the moment I’m resisting doing that as much as possible.

At least I have a clear answer on this now.