invalid device function

Im getting error mesages from my CUDA card that Im finding it all but impossible to solve. After 4 weeks I am a failure. The card is a Quadro FX 1600M that im using to develop with. Im aware its a cc11 and thye second message is related to it being less thatn cc13 but I dont know what the error relates to and how to get round it (if indeed I can). Is it simply that I cant call kernels on this card?

Im simply trying to run the example mmul from the manual.

I compile with
pgfortran.exe -V -ta=nvidia:cuda4.0,nofma,time,wait -Mconcur -Mvect -Mcuda=cc11,ptxinfo -mp -Mchkfpstk -Mipa=fast,inline -Mpfi -Minfo -o mgeno2_MC.exe mgeno2_MC.f90


the error message I get is
42
limit is not supported on this architecture
when i execute this line
Cdev = Csub(1:N,1:L)

and
8
invalid device function

Error! Kernel failed!
when calling the kernel
call mmul_kernel<<>>( Adev, Bdev, Cdev, N, M, L )

The properties of the card are:
Cuda REset device:
no error

number of cuda devices: 1
got cuda device: 0
Cuda set device:
no error

cuda_prop name:
Quadro FX 1600M



cuda_prop major: 100
cuda_prop minor: 10
cuda_prop MP count: 4
cuda_prop GlobalMem: 268107776
cuda_prop ConstMem: 65536
cuda_prop MemPerBlock: 16384
cuda_prop WarpSize: 32
cuda_prop maxThreadsPerBlock: 512
cuda_prop maxThreadsDim(1): 512
cuda_prop maxThreadsDim(2): 512
cuda_prop maxThreadsDim(3): 64
cuda_prop maxGridSize(1): 65535
cuda_prop maxGridSize(2): 65535
cuda_prop maxGridSize(3): 1
cuda_prop asyncEngineCount: 1
cuda_prop integrated: 0
cuda_prop canMapHostMemory: 1
cuda_prop concurrentKernels: 0
cuda_prop ECCEnabled: 0
cudaLimitStackSize: 0
cudaLimitMallocHeapSize: 0
Cuda Driver Version: 4000
Cuda RunTime Version: 4000

I note the stacksize is zero = not sure why.

Any help would be very gratefully recieved since Im now going mad
Mike

Hi Mike,

Is it simply that I cant call kernels on this card?

While I don’t know the specifics of your device, I would think it’s CUDA capable. I can run the example on my laptop’s Quadro FX 880M.

Are you able to run CUDA C code?

What happens if you simplify your flag set to just:

pgfortran.exe -V -Mcuda=cc11 -Minfo -o mgeno2_MC.exe mgeno2_MC.f90



I note the stacksize is zero = not sure why.

I find this suspicious as well especially given the “limit is not supported on this architecture” error.

  • Mat

Mat, thanks for the reply - I ran the matrix multiply example frorm the CUDA4 SDK Browser which I think is written in C and the output is below (after the devicequery results).
So it looks like the card can run CUDA - are you suggesting its a fortran issue?

I can send the code and datafile if that will help although all it is is an implementation of the matmul example.

Thanks
Mike

Device 0: “Quadro FX 1600M”
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 256 MBytes (268107776 bytes)
( 4) Multiprocessors x ( 8) CUDA Cores/MP: 32 CUDA Cores
GPU Clock Speed: 1.25 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D
=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192)
x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Versi
on = 4.0, NumDevs = 1, Device = Quadro FX 1600M
[deviceQuery.exe] test results…
PASSED


[matrixMul.exe] starting…
[ matrixMul ]
C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA G
PU Computing SDK 4.0\C\Bin\win32\release\matrixMul.exe Starting (CUDA and CUBLAS
tests)…

Device 0: “Quadro FX 1600M” with Compute 1.1 capability

Using Matrix Sizes: A(160 x 320), B(160 x 320), C(160 x 320)

Runing Kernels…

CUBLAS Throughput = 22.5992 GFlop/s, Time = 0.00072 s, Size = 16384000
Ops

CUDA matrixMul Throughput = 18.6016 GFlop/s, Time = 0.00088 s, Size = 16384000
Ops, NumDevsUsed = 1, Workgroup = 256

Comparing GPU results with Host computation…

Comparing CUBLAS & Host results
CUBLAS compares OK

Comparing CUDA matrixMul & Host results
CUDA matrixMul compares OK

So it looks like the card can run CUDA - are you suggesting its a fortran issue?

Possible. If you’re just running our basic matmul.cuf example (http://www.pgroup.com/lit/samples/matmul.CUF) then it’s most likely a compatibility issue.

Have you tried to compile with the reduced flag set? My concern is that you have both the PGI Accelerator Model flag (-ta) and CUDA Fortran flags (-Mcuda) but there are inconstancies between them (CUDA 4.0, cc11). Since this is a CUDA Fortran code, I’d like to know what happens when you compile without any flags except “-Mcuda=cc11”.

If this works, then start adding back flags until you get the failure. If still fails, then we’ll need to dig deeper.

Also, are you running Win64 or Win32? Which compiler version?

  • Mat

I forgot - windows xp 32bit and pgi 11.8

hmmm, it looks like we may be getting soimewhere. Im sorry I didnt realise that you couldnt use -ta=nvidia and -Mcuda at the same time.

I removed all flags except
pgfortran.exe -V -Mcuda=cc11 -Minfo -o mgeno2_MC.exe mgeno2_MC.f90

and compiled bit it gave me a few unresolved externals (cudaDeviceReSet()) which I find odd because there are other cuda calls e.g. cuda_info = cudaGetDeviceCount(cuda_numdevices)
which are OK - does that mean they are in different libraries?

I commented the offending ones out and it compiles and runs - the error 42 has gone but still the invalid device function remains

sizeAsub=: 23936
sizeBsub=: 23936
sizeCsub=: 64
sizeCsub1=: 8
sizeCsub2=: 8
mmulmc: Allocating Adev, Bdev and Cdev
mmulmc: dimGrid= 2 2 1
mmulmc: dimBlock= 4 4 1
mmulmc: assigning grid
mmulmc: starting main loop
mmulmc: zeroising Csub
mmulmc: Copy zeroised Csub to Cdev
mmulmc: calling mmul_kernel
8
invalid device function

arning: ieee_inexact is signaling
rror! Kernel failed!
thread 0 stack: max 2241KB, used 0KB
thread 1 stack: max 0KB, used 0KB

Hi Mike,

“cudaDeviceReset” is new in CUDA 4.0 and why you’re getting an undefined external error. Try again with “-Mcuda=cuda4.0”. By default CUDA 3.2 is used.

Im sorry I didnt realise that you couldnt use -ta=nvidia and -Mcuda at the same time.

You can but you had mismatch settings.

Besides adding “cudaDeviceReset”, have you made other changes to the Matmul example?

  • Mat

Mat, Im very grateful for your help with this - Ive been at it for weeks and getting nowhere!

When I include the following code I get the error 42

cuda_info=cudaDeviceGetLimit(retVal,cudaLimitStackSize)
print *, “cudaLimitStackSize:”, cudaGetErrorString(cuda_info)
print *, “cudaLimitStackSize:”, retVal

as well in the line

Cdev = Csub(1:N,1:L)

When I comment out the 3 lines I get no error in both places - it seems the call to cudaDeviceGetLimit causes errors elsewhere.
Ive have adapted the mmul routine to be generic and multiply large matrices in blocks. My intention is to make a matmul routine that wil run on any CUDA card with any amount of memory. We need to multiply then invert matrices of 20,000 * 50,000

Im still getting erro 8 invalid device function
Mike

Hi Mike,

I tested “cudaDeviceGetLimit” on my laptop and get the same error. It also fails on a CC1.3 device but succeeds on a CC2.0. So the error seems to be expected on your device.

Im still getting erro 8 invalid device function

Can you post a reproducing example or send the full source to PGI Customer Service (trs@pgroup.com)? Probably easiest if I just look at the code instead of guessing.

  • Mat

OK Mat will do and thanks very much for your help - its driving me crazy! Ill email the source and a little of the data

mike

Mike and corresponded via email and determined that the invalid device function error was due to him having a double precision value in his kernel. CC11 devices don’t support double precision.

In addition to this, he was also using dynamic shared memory (automatics) in his kernel but failed to add the shared memory size at the kernel launch. This was causing his kernel to crash. To fix, needed to either use fixed size shared memory arrays or add the shared memory size at the kernel launch.

call mmul_kernel  dimGrid,dimBlock,(BLOCK_SIZE*BLOCK_SIZE*8)   ( Adev, Bdev, Cdev, N, M, L )
  • Mat