Hello! Mat
I test the code in another 32-bit Windows 7, a cc 2.1 card is in this computer. Even it is 32-bit OS, the same error accurs.
Therefore I think that the error may not caused by the OS. There may be some bugs in my code, or the error may related to the computation capability of different cards. After checking, I have found the bug.
In the kernel CceKernel. The configuration is <<<1,ngs>>>.
attributes(global) subroutine CceKernel(x,xf,r,&
icall,&
cx,cf,s,sf,x1,xf1)
...
igs=threadIdx%x
...
! Evolve sub-population igs for nspl steps
do iloop = 1 , nspl_c
! ---------------- ----------------
! Select simplex by sampling the complex
! according to a linear probability distribution
lcs(1,igs) = 1
do k3 = 2 , nps_c
do
lpos = 1 + &
int(real(npg_c) + 0.5 - ((real(npg_c) + 0.5) ** 2.0 - real(npg_c) * (real(npg_c) + 1.0) * rnd(rr)) ** 0.5)
found=.false.
k2=k3-1
do k1=1,k2
if (lcs(k1,igs)==lpos) then
found=.true.
exit
end if
end do
if (.not. found) exit
end do
lcs(k3,igs) = lpos
end do
...
end subroutine
The bug lies in the loop
do
lpos = 1 + &
int(real(npg_c) + 0.5 - ((real(npg_c) + 0.5) ** 2.0 - real(npg_c) * (real(npg_c) + 1.0) * rnd(rr)) ** 0.5)
found=.false.
k2=k3-1
do k1=1,k2
if (lcs(k1,igs)==lpos) then
found=.true.
exit
end if
end do
if (.not. found) exit
end do
The number of iterations of this loop may be different in different threads. And a “if (.not. found) exit” is in the loop to ensure that the do loop may not waste too much computational time.
In the cc 1.1 card (I test the Geforce 8400m gs and Geforce 9800 gt. It may be 9800 or 9600, I do not remember very clearly, because it is not my card.), it can run with right result.
But in the cc 3.5 card (I test the Geforce GT 730m), it can not run.
Therefore I think that there may be some bugs in my code and the problem may be caused by the architecture of different cards.
I revise this loop. I add a fixed iteration number like:
do itmp=1,1000
…
if (.not. found) exit
end do
After the revision, it can run with right result by Geforce GT 730m.
I also try to revise this loop as:
do itmp=1,100000
…
if (.not. found) exit
end do
with the increasing of iteration upper boundary (from 1000 to 100000), the code costs more computational time by Geforce GT 730m.
However, in Geforce 8400m gs card, the computational time has no relationship with the iteration upper boundary.
Acturally, I know the algorithm of my code, this loop do not need to run so much times such as 1000 or 100000. Usually it loops less than 1000 times and can exit without delay when the “if (.not. found) exit” is satisfied. These means that the computational time of this loop has no relationship with the iteration upper boundary. In cc 1.1 card, that is the case. However, in cc 3.5 card, that is not the case.
As a summary, I have two questions:
- Why does the original do loop can not work in newer card but can run in old cards? Why it can run in newer card when I add the interation upper boundary (such as 1000 and 100000)?
- Why the iteration upper boundary affects the computational time of cc 3.5 card, but has no influence in the cc 1.1 card.
Thank you very much!
Nightwish