# Problem while parallelizing loops

Hello, i have the following problem and hope you have advice for me. I try to change the following code to the gpu:

``````	!Temp Array!
temp(1) = nin1
temp(2) = nout1
temp(3) = atb
temp(4) = after
temp(5) = atn
temp2(1) = cr2
temp2(2) = ci2
temp2(3) = cr3
temp2(4) = ci3
temp2(5) = bb
!END Temp Array

!\$acc data region local(r,r1,r2,r3,s,s1,s2,s3,ninout)
!\$acc* copyin(temp(1:5),temp2(1:5),zin(1:2,1:nfft,:))
!\$acc* copyout(zout)
!\$acc region do independent
do ib=1,before
!Ninout Array!
ninout(1)=temp(1)+(ib*temp(4))
ninout(2)=ninout(1)+temp(3)
ninout(3)=ninout(2)+temp(3)
ninout(4)=temp(2)+(ib*temp(5))
ninout(5)=ninout(4)+temp(4)
ninout(6)=ninout(5)+temp(4)
!END Ninout Array!
do j=1,nfft
r1=zin(1,j,ninout(1))
s1=zin(2,j,ninout(1))
r=zin(1,j,ninout(2))
s=zin(2,j,ninout(2))
r2=r*temp2(1) - s*temp2(2)
s2=r*temp2(2) + s*temp2(1)
r=zin(1,j,ninout(3))
s=zin(2,j,ninout(3))
r3=r*temp2(3) - s*temp2(4)
s3=r*temp2(4) + s*temp2(3)
r=r2 + r3
s=s2 + s3
zout(1,j,ninout(4)) = r + r1
zout(2,j,ninout(4)) = s + s1
r1=r1 - .5d0*r
s1=s1 - .5d0*s
r2=temp2(5)*(r2-r3)
s2=temp2(5)*(s2-s3)
zout(1,j,ninout(5)) = r1 - s2
zout(2,j,ninout(5)) = s1 + r2
zout(1,j,ninout(6)) = r1 + s2
zout(2,j,ninout(6)) = s1 - r2
enddo
enddo
!acc end region
!\$acc end data region
``````

The Compiler tells me :

710, Generating local(ninout(:))
Generating local(s3)
Generating local(s2)
Generating local(s1)
Generating local(s)
Generating local(r3)
Generating local(r2)
Generating local(r1)
Generating local®
Generating copyout(zout(:,:,:))
Generating copyin(zin(:,:nfft,:))
Generating copyin(temp2(:))
Generating copyin(temp(:))
713, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
714, Loop is parallelizable
Accelerator kernel generated
714, !\$acc do parallel, vector(256) ! blockidx%x threadidx%x
Cached references to size [5] block of ‘temp’
Cached references to size [6] block of ‘ninout’
Cached references to size [5] block of ‘temp2’
CC 1.3 : 22 registers; 180 shared, 16 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 33 registers; 92 shared, 100 constant, 0 local memory bytes; 50% occupancy
723, Loop is parallelizable

But if I run the programm the following error occurs

call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 3020

## Accelerator Kernel Timing data /home/gast/SOURCE/./gfft.f fftstp 713: region entered 1 time time(us): init=0 714: kernel launched 1 times grid: [1] block: [256] time(us): total=0 max=0 min=0 avg=0 /home/gast/SOURCE/./gfft.f fftstp 710: region entered 1 time time(us): init=78930 data=88

I guess it has something to do with the definition of the array, because the compiler seems to not take the values i’ve entered for the dimensions. Were I wrote a (1:2,1:nfft,:)
he makes (:,:nfft,:) or am I wrong here. Its a very large program and i try to parallelize only some time consuming parts. I dont know exactly which dimensions the third dimension for the array zin and zout will be. Is there a way to define them open in the third dimension?

Thanks very much

Hi uni-gw

I guess it has something to do with the definition of the array, because the compiler seems to not take the values i’ve entered for the dimensions. Were I wrote a (1:2,1:nfft,:)
he makes (:,:nfft,:)

I doubt that this is the problem. The “:” just means the full extent which I assume for zin is “1:2”, and “:nfft” is short-hand for “1:nfft” since 1 is the lower-bound.

call to EventSynchronize returned error 700: Launch failed

This typically means that your device kernel abnormally aborted or some reason. The first thing I would check is if all values of ninout are less than the bounds of zout’s third dimension. An out-of-bounds error is the most common cause (at for the ones I’ve looked at).

grid: [1] block: [256]

A secondary issue is the poor schedule being generated will lead to poor performance. You may wish to consider moving the ninout initialization code inside the j loop since this code inhibits the parallelization of the j loop. Unless ‘before’ is very large, the cost of the extra computation should be offset by the additional parallelization.

Also, if it’s possible, you should move ‘j’ to the first dimension of your arrays. This will allow for contiguous data access across the threads and limit memory divergence. (http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)

Something like:

``````!\$acc data region local(r,r1,r2,r3,s,s1,s2,s3,ninout)
!\$acc* copyin(temp(1:5),temp2(1:5),zin(1:2,1:nfft,:))
!\$acc* copyout(zout)
!\$acc region do independent
do ib=1,before
!\$acc region do independent
do j=1,nfft

!Ninout Array!
ninout(1)=temp(1)+(ib*temp(4))
ninout(2)=ninout(1)+temp(3)
ninout(3)=ninout(2)+temp(3)
ninout(4)=temp(2)+(ib*temp(5))
ninout(5)=ninout(4)+temp(4)
ninout(6)=ninout(5)+temp(4)
!END Ninout Array!

r1=zin(j,1,ninout(1))
s1=zin(j,2,ninout(1))
...
``````

Hope this helps,
Mat