acc copied data apparently not being copied back to CPU

Hi,

I am trying to adapt a program that runs successfully with OpenMP to OpenACC.

The problem I am having is that the array I am using for state information appears to be returning to the CPU context without being properly copied from the GPU.

It’s quite possible that the data is simply not being set correctly in the first place. It’s hard to know what is going on. This is the first time I’m trying to program on a GPU

I have this construct:

#pragma acc kernels loop
for (s = 0; s < STATE_MAX; s++) {
operate(&calc_states[s]);
}

Now, when I iterate through calc_states after that point, I am not seeing the correct data set. Mostly the members are set to zero. The structure itself is a series of byte_t arrays (no pointers).

On regular CPU programming, this works fine. On OpenACC I included the pragma

#pragma acc routine worker

before each of the function calls ‘down the tree’ of exection, at the behest of the compiler since it said I need to define an acc routine for each function call. Is that where my problem is? Are threads exiting too fast and producing corrupt data? I’d appreciate any pointers ,no pun intended.

Thanks.

Hi Delphis,

My best guess is that you’re not copying “calc_states”. Do you have this array in a data region or are you relying on the compiler to perform the copy? Can you please post the compiler feedback messages (i.e. the output when compiling with -Minfo=accel).

As for the routine directive, do you have a parallel loop in “operate”? If not, then you should probably just use “acc routine seq”. Otherwise, be sure to decorate the loop in “operate” with a “acc loop” directive. IF you have multiple parallel nested loops in “operate”, then “worker” would be appropriate. Otherwise, I’d use “acc routine vector” if there’s only one loop.

If you can, please post an example of the code. A full reproducible example would be very helpful.

-Mat

I was confused on the different levels of parallelism and what they all mean. I thought ‘worker’ was like separate ‘threads’ on a CPU. In changing them all to ‘seq’ (for now) the data comes back properly.

The computation is incorrect compared to the CPU version, but at least I’m getting things back now. I just need to study more, thank you for pointing me the right direction!