quite puzzled

Excuse me. My code had wrote the "!$acc data copyout(umin,umout) "and the informational messages also told me that “1204, Generating copyout(umout),Generating copyout(umin)”
However, it is definitely that they were not transfer to my cpu code. (In the kernel with the right result while always be zero outside)
Well,I just can`t get through it.

Hi Kevin,

However, it is definitely that they were not transfer to my cpu code. (In the kernel with the right result while always be zero outside)

. Just so I understand, you have copyout of an array but the halo coming back which should be all zero’s has garbage values? copyout doesn’t initialize data on the GPU, hence you must set all values within your compute region. Otherwise, garbage values will be returned for the uninitialized elements. Using “copy” instead will initialize the values.

  • Mat

Dear Mat,

Thanks a lot for your answer. With your explaination of [copy/copyout] Ive actually soluted the problem that I met in the getu12 subroutine discussed in another topic.While it still wont get through here.

It was my fault not making my words clear. I had reset the value within the compute region and indeed got a right value after computation.while when my subroutine come back I got a zero all the time even though I have changed the directive from copyout to copy. So it seems to be Unbelieveable and beyond your help ability?

But,really why`that? Well,I should left it for the moment.

Would you please tell me any way to deal with a 5 level loops? Right now I add the acc directive like that below. Are there any better suggestion for it? Since I have run the kernel for ki times.
do k = ks, ke+2
do i = is, ie+2
!$acc data create(dis1,dis2),copyout(AAx1, AAy1, AAz1,AAx2, AAy2, AAz2)
!$acc kernels
AAx1 = c0
AAx2 = c0
AAy1 = c0
AAy2 = c0
AAz1 = c0
AAz2 = c0
do n = ks+1, ke+1
do m = js+1, je+2
yele-1
do l = is+1, ie+1
dis1 = sqrt( disx2(i,l) + disy2(js,m) + disz2(k,n) )
dis2 = sqrt( disx2(i,l) + disy2(je+2yele,m) + disz2(k,n) )
AAx1 = AAx1 + muf4pi * jx(l,m,n) * dv(m,n) / dis1
AAx2 = AAx2 + muf4pi * jx(l,m,n) * dv(m,n) / dis2
AAy1 = AAy1 + muf4pi * jy(l,m,n) * dv(m,n) / dis1
AAy2 = AAy2 + muf4pi * jy(l,m,n) * dv(m,n) / dis2
AAz1 = AAz1 + muf4pi * jz(l,m,n) * dv(m,n) / dis1
AAz2 = AAz2 + muf4pi * jz(l,m,n) * dv(m,n) / dis2
end do
end do
end do
!$acc end kernels
!$acc end data
Ax(i,js,k) = AAx1
Ax(i,je+2
yele,k) = AAx2
Ay(i,js,k) = AAy1
Ay(i,je+2yele,k) = AAy2
Az(i,js,k) = AAz1
Az(i,je+2
yele,k) = AAz2
end do
end do
Thanks a lot anyway.

Excuse me, actually I`ve tried the mirror derective for a change. Still failed at the moment. Probably this time is for my short of knowledge about mirror and reflected.

Hi Kevin,

while when my subroutine come back I got a zero all the time even though I have changed the directive from copyout to copy. So it seems to be Unbelieveable and beyond your help ability?

What would help is to have a small example that reproduces the problem. If you’re unable to create a small example, you can send the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

Assuming the trip count of the “k” and “i” loop are large, I’d put these two into a 2D gang and then put the inner loops into a vector with a reduction clause. If “k” and “i” are small, move the data region outside of “k”.

Also, you don’t want to put the “AA” variables in a copyout. Instead, let the compiler create a reduction. By putting “dis1” and “dis2” in a create clause, you’ve promoted them to global scalar variables shared by all threads. This will cause a race condition and wrong answers. Finally, you do want to put you arrays in the data region.

Version 1 would look something like the following. Though, since I don’t have the full code to test, you may need to make a few changes

!$acc data copyin(disx2,disy2,disz2,jx,dv), copy(Ax,Ay,Az)
!$acc kernels
!$acc loop collapse(2) gang 
do k = ks, ke+2
do i = is, ie+2
AAx1 = c0
AAx2 = c0
AAy1 = c0
AAy2 = c0
AAz1 = c0
AAz2 = c0
!$acc loop vector reduction(+:AAx1,AAx2,AAy1,AAy2,AAz1,AAz2)
do n = ks+1, ke+1
do m = js+1, je+2*yele-1
do l = is+1, ie+1
dis1 = sqrt( disx2(i,l) + disy2(js,m) + disz2(k,n) )
dis2 = sqrt( disx2(i,l) + disy2(je+2*yele,m) + disz2(k,n) )
AAx1 = AAx1 + muf4pi * jx(l,m,n) * dv(m,n) / dis1
AAx2 = AAx2 + muf4pi * jx(l,m,n) * dv(m,n) / dis2
AAy1 = AAy1 + muf4pi * jy(l,m,n) * dv(m,n) / dis1
AAy2 = AAy2 + muf4pi * jy(l,m,n) * dv(m,n) / dis2
AAz1 = AAz1 + muf4pi * jz(l,m,n) * dv(m,n) / dis1
AAz2 = AAz2 + muf4pi * jz(l,m,n) * dv(m,n) / dis2
end do
end do
end do
Ax(i,js,k) = AAx1
Ax(i,je+2*yele,k) = AAx2
Ay(i,js,k) = AAy1
Ay(i,je+2*yele,k) = AAy2
Az(i,js,k) = AAz1
Az(i,je+2*yele,k) = AAz2
end do
end do 
!$acc end kernels
!$acc end data

Version 2 where you only accelerate the inner loops:

!$acc data copyin(disx2,disy2,disz2,jx,dv), copy(Ax,Ay,Az)
do k = ks, ke+2
do i = is, ie+2
AAx1 = c0
AAx2 = c0
AAy1 = c0
AAy2 = c0
AAz1 = c0
AAz2 = c0
!$acc kernel loop 
do n = ks+1, ke+1
do m = js+1, je+2*yele-1
do l = is+1, ie+1
dis1 = sqrt( disx2(i,l) + disy2(js,m) + disz2(k,n) )
dis2 = sqrt( disx2(i,l) + disy2(je+2*yele,m) + disz2(k,n) )
AAx1 = AAx1 + muf4pi * jx(l,m,n) * dv(m,n) / dis1
AAx2 = AAx2 + muf4pi * jx(l,m,n) * dv(m,n) / dis2
AAy1 = AAy1 + muf4pi * jy(l,m,n) * dv(m,n) / dis1
AAy2 = AAy2 + muf4pi * jy(l,m,n) * dv(m,n) / dis2
AAz1 = AAz1 + muf4pi * jz(l,m,n) * dv(m,n) / dis1
AAz2 = AAz2 + muf4pi * jz(l,m,n) * dv(m,n) / dis2
end do
end do
end do
Ax(i,js,k) = AAx1
Ax(i,je+2*yele,k) = AAx2
Ay(i,js,k) = AAy1
Ay(i,je+2*yele,k) = AAy2
Az(i,js,k) = AAz1
Az(i,je+2*yele,k) = AAz2
end do
end do 
!$acc end data
  • Mat

Dear Mat,

Thank you very much for sending me the double versions.

Ive tried the version 1 and it told me that " 6112, Accelerator restriction: induction variable live-out from loop: i Accelerator restriction: induction variable live-out from loop: k 6113, Accelerator restriction: induction variable live-out from loop: i Accelerator restriction: induction variable live-out from loop: k ......" and I dont know how to add the "[do private] derective eventhrough with some tests.

code of version 2 is the one I am using. I was supposed to put the !$acc data outside.It seems like the !$acc kernel loop should be !$acc kernels.

I`ve tried the version 1 and it told me that "
6112, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k
6113, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k

I’d need to see the full code to tell why but most likely you’re using i and k later in the program without initializing them (or in a conditional branch). To work around this, add the private clause.

!$acc loop collapse(2) gang private(i,k)



I was supposed to put the !$acc data outside.

Yes, you want he data region above the outermost loop so that you don’t repeatedly copy data over for each iteration of i and k. Though, I made a mistake adding the “A” arrays in version 2. They are updated in host code so shouldn’t be part of the data region.

It seems like the !$acc kernel loop should be !$acc kernels.

Either works. You just need to remember to add the end kernel directive at then end of the n loop.

  • Mat

To be familiar with acc, I`ve tried to add those directives to a couple of test programs. Some succeeded with almost the same speed to my GPU code(Cuda Fortran) while others failed.
I thought I had found the problem. But unfortunately, I am still struggling for why and how. Just like this.
Case1
region entered 1638 times
time(us): total=653,108 init=138 region=652,970 —A---
kernels=601,373 —B---
w/o init: total=652,970 max=496 min=393 avg=398
68: kernel launched 1638 times
grid: [2x32] block: [64x4]
time(us): total=601,373 max=377 min=361 avg=367
Case2
region entered 360 times
time(us): total=16,644 init=25 region=16,619 —C---
kernels=3,954 —D---
w/o init: total=16,619 max=190 min=43 avg=46
2140: kernel launched 360 times
grid: [1-52] block: [128]
time(us): total=3,954 max=18 min=10 avg=10

As you see, case1s kernels nearly equal to region (marked with A&B)while case2s region is 4 times larger. Another place may counts is probably the message of grid that one is definitely [2*32] and the other one [1~52].

Since those directives are all added by me. I did not expect different results like that and just can`t get through it.

Hi Kevin,

The “region” is measured from the CPU while the “kernels” is measured from the device. The difference between the two is the basically the overhead to launch the kernel. For case 1, the overhead per kernel launch is ~31.5 us: (652970-601373-138) / 1638. While in case 2 the overhead is just a bit higher at ~35.1 us: (16,619-3954-25) / 360.

The main difference between Case 1 and Case 2 is that the average kernel time is much smaller in Case 2 causing the overhead to dominate the total time.

  • Mat

Dear Mat,

That is it. I see your point there and I know how to do it now.

Thank you very much.