Hi all!

I have a problem with multiGPU. Code runs 100 times slower than on a single gpu.

**Single GPU version.**

Memory allocation:

```
!$acc data copyin(h1(0:n1), h2(0:n2), h3(0:n3), h1plus(-1:n1), h2plus(-1:n2), h3plus(-1:n3))&
!$acc copyin(u(-1:n3,-1:n2,-1:n1 + 1), v(-1:n3,-1:n2 + 1,-1:n1)) &
!$acc copyin(w(-1:n3 + 1,-1:n2,-1:n1), t(0:n3,-1:n2 + 1,-1:n1 + 1)) &
!$acc copyin(p(0:n3 - 1,0:n2 - 1,0:n1 - 1), eig(0:n3 - 1,0:n2 - 1,0:n1 - 1)) &
!$acc create(dp(0:n3 - 1,0:n2 - 1,0:n1 - 1), u_new(-1:n3,-1:n2,-1:n1 + 1), v_new(-1:n3,-1:n2 + 1,-1:n1), w_new(-1:n3 + 1,-1:n2,-1:n1), t_new(0:n3,-1:n2 + 1,-1:n1 + 1))
```

After it I use kernels to run several 3d loops:

```
!$acc kernels present(h1,h2,h3,h1plus,h2plus,h3plus,u,v,w,u_new,v_new,w_new,p,t)
!$acc loop collapse(3) gang
do i = 1, n1 - 1
do j = 0, n2 - 1
do k = 0, n3 - 1
...
end do
end do
end do
!$acc loop collapse(3) gang
do i = 0, n1 - 1
do j = 1, n2 - 1
do k = 0, n3 - 1
...
end do
end do
end do
!$acc loop collapse(3) gang
do i = 0, n1 - 1
do j = 0, n2 - 1
do k = 1, n3 - 1
...
end do
end do
end do
!$acc end kernels
```

It takes 3.0860900878906250E-003 seconds to run such code on single Tesla V100.

Since there are 4 GPUs available on a single cluster node, I decided to take advantage of them.

**Multiple GPU version.**

I changed memory allocation:

```
!$omp parallel private(iam, devicenum) num_threads(ngpus)
iam = omp_get_thread_num()
devicenum = iam + 1
call acc_set_device_num(devicenum,acc_device_nvidia)
!$acc enter data copyin(h1(0:n1), h2(0:n2), h3(0:n3), h1plus(-1:n1), h2plus(-1:n2), h3plus(-1:n3))&
!$acc copyin(u(-1:n3,-1:n2,-1:n1 + 1), v(-1:n3,-1:n2 + 1,-1:n1)) &
!$acc copyin(w(-1:n3 + 1,-1:n2,-1:n1), t(0:n3,-1:n2 + 1,-1:n1 + 1)) &
!$acc copyin(p(0:n3 - 1,0:n2 - 1,0:n1 - 1), eig(0:n3 - 1,0:n2 - 1,0:n1 - 1)) &
!$acc create(dp(0:n3 - 1,0:n2 - 1,0:n1 - 1), u_new(-1:n3,-1:n2,-1:n1 + 1), v_new(-1:n3,-1:n2 + 1,-1:n1), w_new(-1:n3 + 1,-1:n2,-1:n1), t_new(0:n3,-1:n2 + 1,-1:n1 + 1))
!$omp end parallel
```

I divided work between GPUs in the last dimension. Starts and stops are now stored in arrays:

```
!$omp parallel private (iam) num_threads(ngpus)
iam = omp_get_thread_num()
!$acc kernels present(h1,h2,h3,h1plus,h2plus,h3plus,u,v,w,u_new,v_new,w_new,p,t)
!$acc loop collapse(3) gang
do i = 1, n1 - 1
do j = 0, n2 - 1
do k = UVPz_start(iam), UVPz_end(iam)
...
end do
end do
end do
!$acc loop collapse(3) gang
do i = 0, n1 - 1
do j = 1, n2 - 1
do k = UVPz_start(iam), UVPz_end(iam)
...
end do
end do
end do
!$acc loop collapse(3) gang
do i = 0, n1 - 1
do j = 0, n2 - 1
do k = Wz_start(iam), UVPz_end(iam)
...
end do
end do
end do
!$acc end kernels
!$omp end parallel
```

Time for this code to run is 0.3 sec.

I don’t really understand what am I doing wrong. Can somebody help me with this?

P.S. I already sent my code to Mat here OpenACC on GPU help

Also I would like to know if OpenACC uses peer-to-peer transfers, since I will need to use fft similar to the link above and I don’t have much time to create MPI program.