quick_GeForce680_x64.exe is not a valid win32 application.

Hi all,

Anyone encountered this issue before? I compiled my fortran 90 code using PVF 12.4 and created both 32 bit and 64 bit executables (named: Quick_GeForce680_x64.exe). I also copied the required cuda runtime DLL libraries and put them in the same directory as the exe files. I can run on my machine which is Windows 7, Dell XPS 8500, Core i7 3.4 GHz, 16 GB ram and GeForce 680 GPU.
but, I sent executables to another company to try, They could not run it. They are using Windows XP 64 bit OS, service pack 2, XEON processor, 8 GB ram and also GeForce 680 GPU. When they run the code, this error message they get:

Quick_GeForce680_x64.exe is not a valid win32 application.

what could be the issue? why windows XP cannot run the exe file? what should I install to prevent this issue from happening?

cheers,
Dolf

Hi Dolf,

I know that you’ll get this error if you using Visual Studio 2012 built exe’s on Windows XP (MS doesn’t support XP anymore) but we didn’t start shipping VS2012 until PGI 13.0 so I don’t think that the issue here.

The next thing I try is to have them install the Microsoft Visual Studio 2010 C runtime libraries. In the “C:\Program Files\PGI\Microsoft Open Tools 10\redist” there are two directories with files starting with “vcredist”. The “amd64” directory has the 64-bit version while “x86” contains the 32-bit versions.

Though, I’d expect a “Side by Side” error rather than a “is not a valid win32 application” if it was the issue so I’m not sure this will work.

Unfortunately, the engineer in change of Visual Studio integration is on vacation, otherwise I’d ask her.

  • Mat

Hi Mat,

I am using VS 2010, not 2012. Not sure if this could be the problem here.
I will ask them to install the run time library and see what they get.

thanks,
Dolf

Hi Mat,

After installing the visual studio runtime on windows XP 64 bit OS with SP2, error message :“this can only be installed on windows vista SP2” appeared, how can eliminate this problem?

thanks,
Dolf

Hi Dolf,

I’ll ask Annemarie when she gets back but I don’t think there’s much that can be done here. Typically, binaries are only forward compatible with respect to operating systems, not backwards. Plus the fact the Microsoft no longer supports XP, makes things more difficult.

If you want to support XP, you’ll need to install XP on your system and then rebuild the program. The rule of thumb is to build on the lowest common denominator.

  • Mat

Hi Mat,

I had the company install windows 7 in their system, but still the code I sent them (which works perfect on my machine) is not working. The error message now is:
Quick5_GeForce680_x64.exe has stopped working.

so now, what do you think is the problem? why the .exe exits like that? did I miss something?
I told the IT tech to install latest display driver and they did, I provided them with the cudart64_50_35.dll to be installed in the same folder as the .exe is that it? do they need to install other software to make it work?
does it matter if they have Intel XEON processor? not Intel i7? this is one difference between my machine (which works) and their machine.
the other difference is I installed cuda development tools version 5 which comes with PVF 13.4 compiler, do they need to install it?

please advice.
Dolf

Hi Dolf,

it matter if they have Intel XEON processor? not Intel i7? this is one difference between my machine (which works) and their machine

Xeon and i7 are just brand names. What you need to know is the processor model. If you had a model that supports AVX instruction and the customer doesn’t, then yes, this can cause illegal instructions to be generated.

Granted, I don’t know if this is the actual problem since “stopped working” is very generic. Though, you can try using the PGI Unified Binary feature and have the compiler target multiple architectures or target a generic 64-bit host processor (-tp=px-64).

Let me check with Annemarie to see if she has any other ideas.

  • Mat

From Annemarie:

Hi Mat,

Couple thoughts. You’ve been on the right track with the redist exes. The “stopped working” messages that Dolf describes are probably happening because of missing DLLs at runtime. Is the user seeing that message launching the exe from a command prompt? By double-clicking the app in Windows Explorer? Sometimes launching from a command prompt yields more error information than just double-clicking the app.

Then I’d have them use depends to track down missing DLLs. Download the x86 version if the exe to be tested is 32-bit; otherwise have them download the x64 version:

http://www.dependencywalker.com/

Dolf could look at the results of his depends run versus the one his colleagues run.

Annemarie

Hi Annemarie and Mat,
thank you for the prompt response, I really appreciate your help.

If you had a model that supports AVX instruction and the customer doesn’t, then yes, this can cause illegal instructions to be generated

what does that mean? what is AVX?

Then I’d have them use depends to track down missing DLLs. Download the x86 version if the exe to be tested is 32-bit; otherwise have them download the x64 version:

I did run dependency walker on my computer, it gave me the following error message:
Errors were detected when processing “quick5_geforce680.exe”. see the log window for details.
error log:
Error: At least one module has an unresolved import due to a missing export function in an implicitly dependent module.

any thoughts?
do you think I am missing kernel32.dll ??
thanks,
Dolf

Hi Dolf,

AVX is a set of newer processor instructions used for vectorization (See: http://en.wikipedia.org/wiki/Advanced_Vector_Extensions). If you build a binary with these instructions and then run the binary on another system with a processor that doesn’t support them, you will get an illegal instruction error and you binary won’t run.

It’s highly doubtful that you’re missing “kernel32.dll”, more likely depends just isn’t finding it. The important thing is to see what the differences are between running depends on your system and running it on the system where the exe won’t run.

  • Mat

Hi Mat,

Latest update: After copying and sending the whole working folder to the company, they were able to run the code. Thanks for the reply.

Now, there is a problem if we run multiple cases, it gives the error: Quick5_GeForce680_x64.exe has stopped working.
it seems like the code runs for the first time just fine. When you run it again, for the same input files, it exits, why this is happening you think?
Also, I have noticed the size of the .exe file get biger after the first run. Is the memory got filled (GPU Ram)? do I have to do garbage cleaning in the code to make it free memory before ending? last time I asked if cuda fortran will clear all memory before exit, you said it would do it automatically. I am just trying to pin point the problem.

Regards,
Dolf

Hi Dolf,

Latest update: After copying and sending the whole working folder to the company, they were able to run the code. Thanks for the reply.

Ok, so you most likely was missing a dependent file from your project.


When you run it again, for the same input files, it exits, why this is happening you think?

The only thing I can think of off hand is if you were trying to open a file with the “new” status and file already exists (from the first). Try running the executable from the command line and see if there are additional error messages.


Also, I have noticed the size of the .exe file get biger after the first run. Is the memory got filled (GPU Ram)?

What do you mean by size? The size in bytes of the exe itself? The size of the memory used on the host?

last time I asked if cuda fortran will clear all memory before exit, you said it would do it automatically

Yes, all memory is implicit free’d after an exe exits (this is true for all programs, not just CUDA Fortran).

  • Mat

What do you mean by size? The size in bytes of the exe itself? The size of the memory used on the host?

Yes, the size in kilobytes of the executable file gets bigger if you run it the second time.
I just ran cases using input files, sometimes the code gives exact results. Other times it gives NaN in the middle of the calculation for one of the real numbers (which been calculated using GPU kernels). Is there a reason for why it give NaN (which means infinite number)?

regards,
Dolf

Yes, the size in kilobytes of the executable file gets bigger if you run it the second time.

This doesn’t make sense to me. The physical size of the binary image on the disk shouldn’t change. I can think of theoretical cases where this would happen (self compiling code, the binary being overwritten) but these are unlikely.

If you’re talking the image size once loaded, then this could be the DLL’s that are being loaded.

I just ran cases using input files, sometimes the code gives exact results. Other times it gives NaN in the middle of the calculation for one of the real numbers (which been calculated using GPU kernels). Is there a reason for why it give NaN (which means infinite number)?

This indicates to me that you have a UMR (uninitialized memory read) or other memory issue with your code. There’s nice utility call Valgrind that can help analyze these types of errors, but it’s only on Linux. Otherwise, you need to use the debugger to try and isolate where the NaN’s start occurring and see if you can figure out why it’s happening.

Also, it turns out that if I run two simulations at the same time, the second one will exit, is it possible to run two codes using same GPU simultaneously? or the GPU will be busy? how can I make the codes share GPU resources just like codes running on CPU?

This will depend upon the type of device you have. If you have a compute capable (CC) 3.x device, then, yes, you can have multiple host context attached to a single device. For CC2.x and CC1.x devices, it may sometimes “work” but it’s not support and each host process should have it’s own GPU.

I also tried running the code from command prompt to capture the error. it does not say much in the prompt. just pop the error message: Quick5_GeForce680_x64.exe has stopped working. A problem caused the program to stop working correctly. Windows will close the program and notify you if a solution is available.
you have two options: Debug - Close program
I clicked on Debug, new instance of VS 2010 opened, and an Unhalted exception at 0x000000014001b9fb in Quick5_Geforce680_x64.exe:
0xC0000005: Access violation writing location 0x0000000141941000.
Also, I can see “Call Stack” window, the code stopped at adpt_grid() subroutine, I don’t know why.
then, in the later line in same “call stack”, there is message "frames below maybe incorrect and/or missing, no symbols loaded for cudart64_50_35.dll
Quick5_GeForce680_x64.exe!adaptnonnested_() subroutine.

It looks like it’s seg faulting. Can you try running the code in the debugger? It may be the same cause as the NaN where you have un-initialized memory.

  • Mat

Hi Mat,

I have GeForce680, I am compiling using CC 3.0. I just realized that the exit of the code is because of illegal memory access. Not sure why.

here is some of the methods I use to initiate memory, copy from device to host, host to device:

allocate(pDev(nx,ny),p1Dev(nx1,ny1),p2Dev(nx2,ny2),p3Dev(nx3,ny3),p4Dev(nx4,ny4), STAT=istat)
if (istat /= 0) print *, ‘error initializing pDev matrix…’

copy from device to host memory:
p1(1:nx1,1:ny1) = p1Dev

copy from host to device:
pDev = p(1:nx,1:ny)

is that the correct way to do it?
Please advice if there is a better and safer way. I am just trying figure out what is the cause for uninitialized memory.

Dolf

I just realized that the exit of the code is because of illegal memory access

Assuming the sizes are correct and the nx/ny variables are initialized, then this code looks fine. Though, an “illegal memory access” may be in your kernel. Did you remember to guard your array accesses so that if you launch a kernel with more threads then there are elements in the array, you don’t have these threads access the arrays?

  • Mat

Did you remember to guard your array accesses so that if you launch a kernel with more threads then there are elements in the array, you don’t have these threads access the arrays?

I do not understand, can you give an example? I don’t think I am using this technique yet.

Dolf

In the following simple kernel, without the test to make sure the element being computed (i.e. “j”) is less than or equal to the total number of elements (“n”), then if the total number of threads (set when the kernel is launched) is greater than “n”, the code would get an access violation.

Having fewer threads is bad too since not all elements would be computed.

        attributes(global) subroutine stream_add(c, a, b, n)
          real*8, device :: c(*), a(*), b(*)
          integer, value :: n
          j = threadIdx%x + (blockIdx%x-1) * blockDim%x
          if (j .le. n) c(j) = a(j) + b(j)
          return
        end subroutine
  • Mat

I applied this method with all my kernels, just like below:

here is how I call the kernel:
threads = dim3(16,16,1)
grid = dim3(ceiling(real(nx1)/threads%x),ceiling(real(ny1)/threads%y), 1)

call restrictPressure_kernel<<<grid,threads>>>(pDev,p1Dev,xrefDev,yrefDev,xref1Dev,yref1Dev,nx, ny, nx1,ny1,enclosingFineRectX1Dev,enclosingFineRectY1Dev)

istat = cudaThreadSynchronize()
if (istat .ne. 0 ) write(,) ‘error restrictPressure kernel’

here is the kernel subroutine:
attributes (global) subroutine restrictPressure_kernel(fineMesh, coarseMesh, xrefFine, yrefFine, xrefCoarse, yrefCoarse,nxFine, nyFine, nxCoarse, nyCoarse, enclosingFineRectX, enclosingFineRectY)

implicit none
integer, value :: nxFine, nxCoarse, nyFine, nyCoarse
real(8) :: fineMesh(nxFine, nyFine), coarseMesh(nxCoarse, nyCoarse), &
xrefFine(nxFine), yrefFine(nyFine), xrefCoarse(nxCoarse), yrefCoarse(nyCoarse)
real(8) :: enclosingFineRectX(nxCoarse), enclosingFineRectY(nyCoarse)
integer :: xIndex, yIndex, i, j
real(8) :: length, height, b, c, xx, yy, H1, H2, H3, H4

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y

if( i <= nxCoarse ) then
if ( j <= nyCoarse ) then

the rest of the subroutine here <<
end if
end if

as you can see I restricted the execution to only the correct threads.

hope I am doing it right.
Dolf