Problem while parallelizing loops

Hello, i have the following problem and hope you have advice for me. I try to change the following code to the gpu:

	!Temp Array!
	temp(1) = nin1
	temp(2) = nout1
	temp(3) = atb
	temp(4) = after
	temp(5) = atn
	temp2(1) = cr2
	temp2(2) = ci2
	temp2(3) = cr3
	temp2(4) = ci3
	temp2(5) = bb
	!END Temp Array

!$acc data region local(r,r1,r2,r3,s,s1,s2,s3,ninout) 
!$acc* copyin(temp(1:5),temp2(1:5),zin(1:2,1:nfft,:))
!$acc* copyout(zout)
!$acc region do independent
        do ib=1,before
	!Ninout Array!
	ninout(1)=temp(1)+(ib*temp(4))
	ninout(2)=ninout(1)+temp(3)
        ninout(3)=ninout(2)+temp(3)
        ninout(4)=temp(2)+(ib*temp(5))
	ninout(5)=ninout(4)+temp(4)
	ninout(6)=ninout(5)+temp(4)
	!END Ninout Array!
        do j=1,nfft
        r1=zin(1,j,ninout(1))
        s1=zin(2,j,ninout(1))
        r=zin(1,j,ninout(2))
        s=zin(2,j,ninout(2))
        r2=r*temp2(1) - s*temp2(2)
        s2=r*temp2(2) + s*temp2(1)
        r=zin(1,j,ninout(3))
        s=zin(2,j,ninout(3))
        r3=r*temp2(3) - s*temp2(4)
        s3=r*temp2(4) + s*temp2(3)
        r=r2 + r3
        s=s2 + s3
        zout(1,j,ninout(4)) = r + r1
        zout(2,j,ninout(4)) = s + s1
        r1=r1 - .5d0*r
        s1=s1 - .5d0*s
        r2=temp2(5)*(r2-r3)
        s2=temp2(5)*(s2-s3)
        zout(1,j,ninout(5)) = r1 - s2 
        zout(2,j,ninout(5)) = s1 + r2
        zout(1,j,ninout(6)) = r1 + s2 
        zout(2,j,ninout(6)) = s1 - r2
	enddo
	enddo
!acc end region
!$acc end data region

The Compiler tells me :

710, Generating local(ninout(:))
Generating local(s3)
Generating local(s2)
Generating local(s1)
Generating local(s)
Generating local(r3)
Generating local(r2)
Generating local(r1)
Generating local®
Generating copyout(zout(:,:,:))
Generating copyin(zin(:,:nfft,:))
Generating copyin(temp2(:))
Generating copyin(temp(:))
713, Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
714, Loop is parallelizable
Accelerator kernel generated
714, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
Cached references to size [5] block of ‘temp’
Cached references to size [6] block of ‘ninout’
Cached references to size [5] block of ‘temp2’
CC 1.3 : 22 registers; 180 shared, 16 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 33 registers; 92 shared, 100 constant, 0 local memory bytes; 50% occupancy
723, Loop is parallelizable


But if I run the programm the following error occurs

call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 3020

Accelerator Kernel Timing data
/home/gast/SOURCE/./gfft.f
fftstp
713: region entered 1 time
time(us): init=0
714: kernel launched 1 times
grid: [1] block: [256]
time(us): total=0 max=0 min=0 avg=0
/home/gast/SOURCE/./gfft.f
fftstp
710: region entered 1 time
time(us): init=78930
data=88

I guess it has something to do with the definition of the array, because the compiler seems to not take the values i’ve entered for the dimensions. Were I wrote a (1:2,1:nfft,:)
he makes (:,:nfft,:) or am I wrong here. Its a very large program and i try to parallelize only some time consuming parts. I dont know exactly which dimensions the third dimension for the array zin and zout will be. Is there a way to define them open in the third dimension?

Thanks very much

Hi uni-gw

I guess it has something to do with the definition of the array, because the compiler seems to not take the values i’ve entered for the dimensions. Were I wrote a (1:2,1:nfft,:)
he makes (:,:nfft,:)

I doubt that this is the problem. The “:” just means the full extent which I assume for zin is “1:2”, and “:nfft” is short-hand for “1:nfft” since 1 is the lower-bound.

call to EventSynchronize returned error 700: Launch failed

This typically means that your device kernel abnormally aborted or some reason. The first thing I would check is if all values of ninout are less than the bounds of zout’s third dimension. An out-of-bounds error is the most common cause (at for the ones I’ve looked at).


grid: [1] block: [256]

A secondary issue is the poor schedule being generated will lead to poor performance. You may wish to consider moving the ninout initialization code inside the j loop since this code inhibits the parallelization of the j loop. Unless ‘before’ is very large, the cost of the extra computation should be offset by the additional parallelization.

Also, if it’s possible, you should move ‘j’ to the first dimension of your arrays. This will allow for contiguous data access across the threads and limit memory divergence. (http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)

Something like:

!$acc data region local(r,r1,r2,r3,s,s1,s2,s3,ninout)
!$acc* copyin(temp(1:5),temp2(1:5),zin(1:2,1:nfft,:))
!$acc* copyout(zout)
!$acc region do independent
        do ib=1,before
!$acc region do independent
        do j=1,nfft

   !Ninout Array!
   ninout(1)=temp(1)+(ib*temp(4))
   ninout(2)=ninout(1)+temp(3)
        ninout(3)=ninout(2)+temp(3)
        ninout(4)=temp(2)+(ib*temp(5))
   ninout(5)=ninout(4)+temp(4)
   ninout(6)=ninout(5)+temp(4)
   !END Ninout Array!

        r1=zin(j,1,ninout(1))
        s1=zin(j,2,ninout(1)) 
...

Hope this helps,
Mat