Hi, Mat, many thanks for the answer. What do you mean “host_data and device resident the others”?
Btw, could you explain the reason a little bit more?
From my understanding: for the thread private common block, each thread will have one copy. When I do something like this:
subroutine mysub()
integer N,i
parameter (N=1048576)
common/blk/t1(N),t2(N),t3(N)
c$omp threadprivate (/blk/)
!$acc kernels
!$acc loop
do i=1,N
t2(i)=t1(i)*t1(i)+t2(i)*t3(N)
enddo
!$acc end kernels
return
end
will the openAcc see different copies of t2 or does the multiple copies cause the problem? It seems that reading t1, t2, t3 is ok. But its the writing that causes the problem. Even I only use 1 thread, I still got the same runtime error, which suggests that something mystery and deep inside the openAcc implementations. If you could explain it or direct me to some reference, that would be great!
I didn’t realize you we’re meaning the OpenMP threadprivate. OpenACC also has a “threadprivate” directive but it’s still in development.
I’m away at a conference so will ask one of our other application engineer to investigate using an OpenMP threadprivate variable within an OpenACC compute region. I’ve not tried it before.
I am not using PGI Accelerator. I am programming in CUDA Fortran and openMP. Is ThreadPrivate feature supported in CUDA Fortran? I am using PGI version 12.6.
Can I declare a device allocatable array as Threadprivate in CUDA Fortran? If yes, can you please show this through an example. I am using PGI version 12.9.
If not, how can I map a CPU ThreadPrivate declared Array to a GPU Array?
Any direct mechanism to do that?