1)As far as i understand, if to write
#pragma acc declare device_resident(number1,number2)
#pragma acc parallel
#pragma acc loop
for(int i=0; i<N1; ++i)
for(int i=0’; i<N2; ++i)
the loops will be executed in parallel on the gpu and the line number1+=number2; will be executed sequentially on the gpu. Is that correct (that the line number1+=number2; will be executed sequentially in 1 thread on the gpu) or not?
2) If to write
#pragma acc parallel loop
for(int i=0; i<N; ++i)
the variable N must be allocated only on the host or it may be also allocated on the gpu?
For #1, code within a parallel region but not within a loop will executed redundantly by each gang, not sequentially. Here, number1 would be incremented once per gang. Though due to the race condition, it may actually end up being less, depending upon if the gangs are reading and writing to the variable at the same time.
For #2, “N” becomes part of the CUDA kernel’s launch configuration with the value used being the host value of “N”. Hence, it only needs to be on the host.
Thank You for the answer.