CUDA shared memory in OpenACC

Hello,

I have a relatively messy nested loop that’s OpenACC-d with either kernels or parallel constructs. Now I added some more logic to it which requires a new small private array of size float[5]. I specify this array as private in either the kernels or parallel construct. With kernels, it works alright (get correct result), but, with parallel, the compiler says:
583, CUDA shared memory used for ztrl
and the result is incorrect - which I am suspecting may be due to the shared memory placement.

Am I assuming correctly that if a variable is placed in a shared memory, it is shared among the threads?

If that is the case, what to do to make it not shared? I tried to shuffle the private(ztrl) clause around (put it to the parallel construct or to the pertinent loop), but at no avail.

Thanks,
MC

Hi MC,

In some cases, we will put private arrays in shared memory when the private is used on a “gang” loop or on the “parallel” construct (without “loop”). The array is private to the gang but will be shared amongst the workers and vectors since private applies to the level (gang, worker, vector) as the loop.

One difference between “kernels” and “parallel” is that with “parallel” you’re telling the compiler where and how to parallelize the loop. While with “kernels” the compiler does the analysis and decides. It could be that with “kernels”, the compiler is applying the “private” to a “vector” loop so it’s not shared amongst threads, but with “parallel”, you have the array shared within a gang.

Do you have a sample code you could post?

Thanks,
Mat

Hi Mat,

thanks for the quick reply (and sorry for late response). The code is quite messy, so, let me start by just showing a basic setup. If that is not sufficient, I’ll try to except piece of it sufficient enough to show this, and to compile and run.

Essentially, I have something like this:
float …, ztrl[5];

#pragma acc parallel present(…) copyin(…) copyout(…) private(…,ztrl)
#pragma acc loop independent
for (i=0;i<N;i++)
{
… calculate some indices, etc
#pragma acc loop seq
for (j=0;j<M;j++)
{
ztrl[0]=something; ztrl[1]=something; … ztrl[4]=something;
do some calculation
}
}

The ztrl gets put on the CUDA shared memory in this setup. And the result is incorrect.

If I replace the #pragma acc parallel and #pragma acc loop with #pragma acc kernels independent, the ztrl is private (compiler does not say that it put it to CUDA shared memory), and the result is correct.

Please, let me know how this looks, and if we should proceed with functional code sample.

Also, is there any way to tell the compiler not to put variables to the shared memory? I thought that the private clause would do it.

Thanks,
MC

Hi MC,

This is your issue:

#pragma acc parallel present(...) copyin(...) copyout(...) private(...,ztrl) 
#pragma acc loop independent 
for (i=0;i<N;i++)

A “private” on the “parallel” construct is private to the gang but shared between vectors. Try moving “private” to the loop construct where the private is applied to the schedule used by the loop, which in this case is most likely a gang+vector (see the output from -Minfo=accel to see what the compiler is using).

#pragma acc parallel present(...) copyin(...) copyout(...) 
#pragma acc loop independent private(...,ztrl) 
for (i=0;i<N;i++)

Note that “private” is not allowed on a “kernels” construct and should have given you an error. Hence, I’m assuming you were using “kernels loop private” which explains why the “kernels” version gets the correct answers.

Hope this helps,
Mat

Thanks Mat,

yes, that was the problem - got that figured out independently too.

Thanks for your help.

I am still fighting with this kernel as I added some extra logic to it and that made it very slow (100x slower than before). I’ll do some more investigation and may open a new thread if I have questions - essentially because of all the divergence in the kernel caused by this extra logic I am debating if it’d be worth going to CUDA for more fine-grained control - but need to check a few things first.

Perhaps one more question here - I tried nvvp to look at the timing in the OpenACC kernel - some info was useful but, I’d like to see a line-like profile to see how much time I am spending at what line (function) or condition. Is that possible with nvvp? Looking around I don’t see it. Or, is there any other tool that could do CUDA/OpenACC line profiling?

Thanks,
MC

What’s the compiler feedback messages telling you? (-Minfo=accel)

Most likely you’ve introduced code which the compiler can’t tell is independent across loop iterations. Things such as computed indices or the potential for pointer aliasing. If the code really doesn’t have dependencies, then you can add the “independent” clause to the loop construct. This tells the compiler that its safe to parallelize. Note that is only required for “kernels” since it’s implicit for “parallel”.

For NVVP line level information, add the flag “-ta=tesla:lineinfo”. While it will get you line profiling information, the code has been heavily transformed so may or may not easily correlate back to your original source.

Hope this helps,
Mat