Hello, i have the following problem and hope you have advice for me. I try to change the following code to the gpu:

```
!Temp Array!
temp(1) = nin1
temp(2) = nout1
temp(3) = atb
temp(4) = after
temp(5) = atn
temp2(1) = cr2
temp2(2) = ci2
temp2(3) = cr3
temp2(4) = ci3
temp2(5) = bb
!END Temp Array
!$acc data region local(r,r1,r2,r3,s,s1,s2,s3,ninout)
!$acc* copyin(temp(1:5),temp2(1:5),zin(1:2,1:nfft,:))
!$acc* copyout(zout)
!$acc region do independent
do ib=1,before
!Ninout Array!
ninout(1)=temp(1)+(ib*temp(4))
ninout(2)=ninout(1)+temp(3)
ninout(3)=ninout(2)+temp(3)
ninout(4)=temp(2)+(ib*temp(5))
ninout(5)=ninout(4)+temp(4)
ninout(6)=ninout(5)+temp(4)
!END Ninout Array!
do j=1,nfft
r1=zin(1,j,ninout(1))
s1=zin(2,j,ninout(1))
r=zin(1,j,ninout(2))
s=zin(2,j,ninout(2))
r2=r*temp2(1) - s*temp2(2)
s2=r*temp2(2) + s*temp2(1)
r=zin(1,j,ninout(3))
s=zin(2,j,ninout(3))
r3=r*temp2(3) - s*temp2(4)
s3=r*temp2(4) + s*temp2(3)
r=r2 + r3
s=s2 + s3
zout(1,j,ninout(4)) = r + r1
zout(2,j,ninout(4)) = s + s1
r1=r1 - .5d0*r
s1=s1 - .5d0*s
r2=temp2(5)*(r2-r3)
s2=temp2(5)*(s2-s3)
zout(1,j,ninout(5)) = r1 - s2
zout(2,j,ninout(5)) = s1 + r2
zout(1,j,ninout(6)) = r1 + s2
zout(2,j,ninout(6)) = s1 - r2
enddo
enddo
!acc end region
!$acc end data region
```

The Compiler tells me :

710, Generating local(ninout(:))

Generating local(s3)

Generating local(s2)

Generating local(s1)

Generating local(s)

Generating local(r3)

Generating local(r2)

Generating local(r1)

Generating local®

Generating copyout(zout(:,:,:))

Generating copyin(zin(:,:nfft,:))

Generating copyin(temp2(:))

Generating copyin(temp(:))

713, Generating compute capability 1.3 binary

Generating compute capability 2.0 binary

714, Loop is parallelizable

Accelerator kernel generated

714, !$acc do parallel, vector(256) ! blockidx%x threadidx%x

Cached references to size [5] block of ‘temp’

Cached references to size [6] block of ‘ninout’

Cached references to size [5] block of ‘temp2’

CC 1.3 : 22 registers; 180 shared, 16 constant, 0 local memory bytes; 50% occupancy

CC 2.0 : 33 registers; 92 shared, 100 constant, 0 local memory bytes; 50% occupancy

723, Loop is parallelizable

But if I run the programm the following error occurs

call to EventSynchronize returned error 700: Launch failed

CUDA driver version: 3020

## Accelerator Kernel Timing data

/home/gast/SOURCE/./gfft.f

fftstp

713: region entered 1 time

time(us): init=0

714: kernel launched 1 times

grid: [1] block: [256]

time(us): total=0 max=0 min=0 avg=0

/home/gast/SOURCE/./gfft.f

fftstp

710: region entered 1 time

time(us): init=78930

data=88

I guess it has something to do with the definition of the array, because the compiler seems to not take the values i’ve entered for the dimensions. Were I wrote a (1:2,1:nfft,:)

he makes (:,:nfft,:) or am I wrong here. Its a very large program and i try to parallelize only some time consuming parts. I dont know exactly which dimensions the third dimension for the array zin and zout will be. Is there a way to define them open in the third dimension?

Thanks very much