I have a relatively messy nested loop that’s OpenACC-d with either kernels or parallel constructs. Now I added some more logic to it which requires a new small private array of size float. I specify this array as private in either the kernels or parallel construct. With kernels, it works alright (get correct result), but, with parallel, the compiler says:
583, CUDA shared memory used for ztrl
and the result is incorrect - which I am suspecting may be due to the shared memory placement.
Am I assuming correctly that if a variable is placed in a shared memory, it is shared among the threads?
If that is the case, what to do to make it not shared? I tried to shuffle the private(ztrl) clause around (put it to the parallel construct or to the pertinent loop), but at no avail.