32 bit CUDA Fortran exe can not run on 64 bit Windows 7

Hello!

I develop a 32 bit CUDA Fortran console application (exe) by the 32 bit PGI Visual Fortran 13.9 on 32 bit Windows 7. However when I copy this exe onto the 64 bit Windows 7 (in which the 32 & 64 bit PGI Visual Fortran 13.9 has been installed), it can not run. The error message is something like “Copyout <…> … Failed 30…”.

I try to construct a 32 bit console application on the 64 bit Windows 7 and recompile the source code. The compile is success but the run is failed with the same error message.

I write some simple examples to check whether the compilers are installed successfully on 32 & 64 bit Windows 7. The examples can run successfully on both the 32 & 64 bit Windows 7 and prove that the compilers are installed successfully.

I wonder what’s wrong and how to solve these problems?

Thank you!

Nightwish

Hi Nightwish,

I don’t know what would cause this so let’s see if we can narrow it down.

My guess is that the actual failure isn’t the copy out but rather the kernel that’s launched right before it. Do you know where this is failing and can you add error checking after the kernel launch?

     
      call mykernel<<<...>>>()
      istat = cudaGetLastError()
      if (istat .ne. 0) then
          print *, "Error at mykernel: ", cudaGetErrorString(istat)
          stop
      endif

Also where are you running the binary from? A PGI DOS cmd window? If so, make sure you’re running from the 32-bit DOS window. Running from the 64-bit window will cause problems since it points to the 64-bit libraries.

What device driver versions are you running?

  • Mat

Thanks for your answer!

I know where the error kernel locates and add error checking as you suggested. I get the error message:

Error at CceKernel:
unknown error

I run the exe file through double clicking it. Not from the command window.
As you suggested, I run the exe from the 32-bit command window (PVF for VS 2012 Cmd). But get the same error message as above listed.

The device driver version is 332.88. The card is GeForce GT 730M. The compiler is PGI Visual Fortran 13.9 and the CUDA toolkit is 5.5. Additionally, I also installed the CUDA 6.0 toolkit.

Can you give me an email-address and could I send my source codes to you and could you test it on your 64-bit windows 7? Because the error message is not very clear and seems a little confused. I think if you can test it mannually, it will be easier to solve the problem.

Thank you very much!

Nightwish

Hi Nightwish,

Please send the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. I’ll take look and see what I can determine.

  • Mat

Thank you Mat!

I have sent the zip file of all source codes to the email address you listed. The title of the email is “CUDA Fortran source code from Nightwish, Please send to Mat to debuging”.

Waiting for your good news. Thank you very much!

Nightwish

Hello! Mat

I test the code in another 32-bit Windows 7, a cc 2.1 card is in this computer. Even it is 32-bit OS, the same error accurs.
Therefore I think that the error may not caused by the OS. There may be some bugs in my code, or the error may related to the computation capability of different cards. After checking, I have found the bug.

In the kernel CceKernel. The configuration is <<<1,ngs>>>.

attributes(global) subroutine CceKernel(x,xf,r,&
										    icall,&
	                                        cx,cf,s,sf,x1,xf1)
		...
		igs=threadIdx%x
		...
		! Evolve sub-population igs for nspl steps
        do iloop = 1 , nspl_c
			! ----------------  ----------------

            ! Select simplex by sampling the complex
            ! according to a linear probability distribution
            lcs(1,igs) = 1
            do k3 = 2 , nps_c
				do
                    lpos = 1 + &
                           int(real(npg_c) + 0.5 - ((real(npg_c) + 0.5) ** 2.0 - real(npg_c) * (real(npg_c) + 1.0) * rnd(rr)) ** 0.5)
                    found=.false.
					k2=k3-1
					do k1=1,k2
						if (lcs(k1,igs)==lpos) then
							found=.true.
							exit
						end if
					end do 
					if (.not. found) exit
				end do
				lcs(k3,igs) = lpos
            end do
...
	end subroutine

The bug lies in the loop

do
                    lpos = 1 + &
                           int(real(npg_c) + 0.5 - ((real(npg_c) + 0.5) ** 2.0 - real(npg_c) * (real(npg_c) + 1.0) * rnd(rr)) ** 0.5)
                    found=.false.
					k2=k3-1
					do k1=1,k2
						if (lcs(k1,igs)==lpos) then
							found=.true.
							exit
						end if
					end do 
					if (.not. found) exit
				end do

The number of iterations of this loop may be different in different threads. And a “if (.not. found) exit” is in the loop to ensure that the do loop may not waste too much computational time.

In the cc 1.1 card (I test the Geforce 8400m gs and Geforce 9800 gt. It may be 9800 or 9600, I do not remember very clearly, because it is not my card.), it can run with right result.

But in the cc 3.5 card (I test the Geforce GT 730m), it can not run.

Therefore I think that there may be some bugs in my code and the problem may be caused by the architecture of different cards.

I revise this loop. I add a fixed iteration number like:

do itmp=1,1000

if (.not. found) exit
end do

After the revision, it can run with right result by Geforce GT 730m.

I also try to revise this loop as:

do itmp=1,100000

if (.not. found) exit
end do

with the increasing of iteration upper boundary (from 1000 to 100000), the code costs more computational time by Geforce GT 730m.

However, in Geforce 8400m gs card, the computational time has no relationship with the iteration upper boundary.

Acturally, I know the algorithm of my code, this loop do not need to run so much times such as 1000 or 100000. Usually it loops less than 1000 times and can exit without delay when the “if (.not. found) exit” is satisfied. These means that the computational time of this loop has no relationship with the iteration upper boundary. In cc 1.1 card, that is the case. However, in cc 3.5 card, that is not the case.

As a summary, I have two questions:

  1. Why does the original do loop can not work in newer card but can run in old cards? Why it can run in newer card when I add the interation upper boundary (such as 1000 and 100000)?
  2. Why the iteration upper boundary affects the computational time of cc 3.5 card, but has no influence in the cc 1.1 card.

Thank you very much!

Nightwish

Hi Nightwish,

We took a look at the code and don’t think the problem has to do with the loop. Rather lcs(:,:) is a shared memory array. The first time through the iloop loop the only value initialized is lcs(1,igs) where igs is threadIdx%x. when the condition (lcs(k1,igs)==lpos) is executed all values except k1=1 are garbage.

Hope this helps,
Mat