CUDA 3.2 on GTX 480 is "busy or unavailable"

I can’t get any of the shipped SDK samples to run, other than maybe deviceQuery. All of the samples that actually launch a kernel crash with:

“Runtime API error : all CUDA-capable devices are busy or unavailable.”

Using the 3.1 the samples run fine, and using 3.2 on a different card (295) is also successful. Oh, I’m using driver 260.99, but it doesn’t work with 260.93 either.

Maybe there’s something obvious I’m overlooking? Any advice is appreciated.

Here’s the deviceQuery output

deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 480"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 1576468480 bytes

  Multiprocessors x Cores/MP = Cores:            15 (MP) x 32 (Cores/MP) = 480 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Clock rate:                                    0.81 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce GTX 480

PASSED

Thanks

Exact thing happens to me. 3 GTX 480s, no SLI. I use the GPUs for computation and I can only get device query to work. Particles fails with “busy or unavailable” error. Hopefully someone can shed some light on a fix.

Any progess? I’m looking at 4 x GTX 570 in a Gigabyte UD9 mainboard.

But it would be a shame to bump into show-stoppers under 64 bit Win7.

My original post was regarding a GTX 480, not a 570. I recently tried the 263.06 driver and I still can’t use CUDA 3.2 without getting the “busy or unavailable” error.

If the current driver/toolkit has an issue with GTX 480/470, I’d fear similar for 580/570. Are you running Vista or Win7?

I wonder if this another of those “Vista Things”. Have you heard of anything similar under XP64?

I’m currently using Win7 x64 (dual 6-core cpus, 48GB memory). I haven’t seen any mention of this problem with XP, although there are some posts on the “General CUDA” forum describing the same problem that don’t identify the OS.

I’ve tried a lot of different configurations and I can’t get the 480 to work with 3.2. The previous version has some performance glitches but otherwise works fine, and after a month of trying to figure this out I’m going back to 3.1.

I can assure you that CUDA 3.2 does work on Win7x64 with GeForce GTX 480 (and other GeForce cards as well)… I have that exact configuration in my desktop machine.

The “busy or unavailable” error message indicates that context creation in the CUDA driver failed for some reason. There are a number of possibilities as to why that might happen, so we’ll have to dig a little bit to figure out what’s causing it for you. Can you tell me more about your system configuration specs? How much system memory, etc.

One thing that would be helpful is if you try one of the Driver API based SDK samples (the ones with names ending in “Drv”)… when you run one of those, it should error out at context creation time (the same as is happening internal to the CUDA Runtime with the other SDK samples) and tell you what driver error happened during context creation. That error code # would be useful to know as well.

Thanks,

Cliff

Cliff, thank you for your attention to this problem.

Since my last post 2 weeks ago, I didn’t end up going back to version 3.1, I just swapped my 480 for a 295 and everything started working again. Since the non-Fermi card isn’t a long term solution for me, I’ve since bought a C2070 and much to my dismay it produces the same errors as the 480 did.

But there’s hope! When the Tesla is running in TCC mode everything works fine, which is good since I was intending to use TCC anyway. My current setup has a C2070 and a Quadro FX1700 to run the display. To help identify what is going wrong I turned off TCC so that I could reproduce the errors.

Here’s the output from matrixMul and matrixMulDrv:

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\bin\win64\Release>matrixMul.exe

[ matrixMul ]

matrixMul.exe Starting...

Device 0: "Tesla C2070" with Compute 2.0 capability

Using Matrix Sizes: A(80 x 160), B(80 x 80), C(80 x 160)

d:/bld_sdk10_x64.pl/rel/gpgpu/toolkit/r3.2/sdk/SDK10/Compute/C/src/matrixMul/matrixMul.cu(127) : cudaSafeCall()

 Runtime API error : all CUDA-capable devices are busy or unavailable.

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\bin\win64\Release>matrixMulDrv.exe

[ matrixMulDrv (Driver API) ]

> Using CUDA Device [0]: Tesla C2070

> GPU Device has SM 2.0 compute capability

  Total amount of global memory:     5589499904 bytes

  64-bit Memory Address:             YES

cuSafeCallNoSync() Driver API error = 0002 from file <.\matrixMulDrv.cpp>, line 89.

Here’s my current deviceQuery information:

deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "Tesla C2070"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5589499904 bytes

  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Clock rate:                                    1.15 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   Yes

  Device has ECC support enabled:                Yes

  Device is using TCC driver mode:               No

Device 1: "Quadro FX 1700"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 511246336 bytes

  Multiprocessors x Cores/MP = Cores:            4 (MP) x 8 (Cores/MP) = 32 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    0.92 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = Tesla C2070, Device = Quadro FX 1700

PASSED

The host workstation:

Dell T7500, Dual Xeon X5680s, 48GB RAM in NUMA, Tesla C2070, Quadro FX 1700, Win7x64. I’ve tried many different driver versions (>260) and I’m using the precompiled version 3.2 binaries.

Thanks

This is very useful information, thanks. The key is in this line:

What’s happening here is exactly as I suspected: cuCtxCreate() is failing. It’s failing with error 2, which you can find in cuda.h as CUDA_ERROR_OUT_OF_MEMORY.

So then the question becomes, why would you run out of memory during context creation? It turns out there was a driver bug a few months ago on Windows systems with relatively large amounts of system memory (where here “large” was defined as >= 48GB, if I recall correctly, which is exactly what you have), where the symptom would be more or less exactly what you’re seeing.

However, this should have been fixed already in the drivers you said you’ve tried. As an experiment, can you try reducing the amount of sysmem in the system down to say 16GB or 32GB and see if the issue can still be reproduced? That will help me to know whether this is the same issue or different issues with similar-sounding symptoms.

Thanks,

Cliff

Cliff, I have an almost identical system as Croow’s: Supermicro SuperServer 7046GT, Dual Xeon X5680s 3.33GHz, Tesla C2050, Quadro FX 580, Win7 x64. I use the newest development driver version 263.06, CUDA Toolkit 3.2, GPU Computing SDK 3.2.

I experienced the same errors when trying to run any of the SDK “C” binaries:

cudaSafeCall() RuntimeAPI error : out of memory
cudaSafeCall() RuntimeAPI error : all CUDA-capable devices are busy or unavailable
(Note that all of the OpenCL binaries work with no problems with 48GB RAM.)

I reduced the amount of Windows 7 x64 physical memory using Run, msconfig – Boot Advanced Options, set Maximum memory to 32 GB (=32,768 MB). Suddenly, everything works fine! Thank you for the insight.

I must say that it is truly embarrassing that this sort of bug was not fixed for “a few months.”

Further on…

I can confirm what Croow discovered (that enabling TCC on the Tesla 2050 cures the problem as well). The SDK C binaries run fine with 48GB RAM and Tesla in TCC mode. That’s progress. We now have two solutions to the problem.

Hi guys,

Thanks for confirming this. I will reopen the bug internally and pass along these details that you all have reported and that the bug has resurfaced. (For what it’s worth, this was thought to have been fixed a while ago; perhaps the fix did not get carried forward to the newer driver branches by some oversight? I will investigate.)

Thanks,

Cliff

I’ve the same problem with an VS2010 x64 build running on 266.58 WHQL (Win32 works). Till this is fixed: Could you tell me which one is the newest driver which doesn’t include the problem?

Thanks,
Stefan

We have the same problem with a VS2008 x64 build running on 266.58 WHQL and CUDA Toolkit v3.2. The system is a server mainboard with 64GB ram and a GTX580 for CUDA calculations. We also have tested the beta drivers (267.31) with the same result (“Runtime API error : all CUDA-capable devices are busy or unavailable.”).

Is there any driver which doesn’t include the problem?

Thanks,
Thomas

Yes, this bug should be fixed in the 270.32 driver that was released with CUDA Toolkit 4.0 RC1 to our Registered Developer site last Friday. Please let me know if that fixes up the issue for you.

Thanks,

Cliff

Hello.

With this driver (270.32) the bug is solved :-).

Thanks for the fast reply,

Thomas

Hmm. With 270.32 (WDDM on GTX 480, Win7x64) Nsight isn’t working (doesn’t stop at breakpoints), even with 3.2 builds. So atm I’ve to limit the system memory to 32 gig on my dell precision T7500 workstation. Is there a chance to soon get a driver that doesn’t contain this bug and works with Nsight? Or do we have to wait for a CUDA 4.0 capable Nsight?

Hi there,

In versions of Parallel Nsight up to now, there has been a fairly tight bond between Nsight and the NVIDIA driver, so the currently released version of Nsight is not yet compatible with drivers from the 270.xx series (which are technically still pre-release drivers, I should point out). The next version of Nsight will have support for 270.xx; I’m told it will also start to relax the version compatibility requirements somewhat for future drivers.

So until the new version of Nsight comes out, you’ll have to choose between >32GB and Nsight, unfortunately. Sorry about that.

Thanks,

Cliff

Hm, I’ve got a similar problem - it’s a Linux machine with Tesla C2050, CUDA 3.2, driver version 260.19.12, Ubuntu 10.10 LTS. It was working just fine earlier today, no updates in between, just playing with some code.

The SDK samples fail:

[ matrixMul ]

./matrixMul Starting...

Device 0: "Tesla C2050" with Compute 2.0 capability

Using Matrix Sizes: A(80 x 160), B(80 x 80), C(80 x 160)

matrixMul.cu(127) : cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable.

[ matrixMulDrv (Driver API) ]

> Using CUDA Device [0]: Tesla C2050

> GPU Device has SM 2.0 compute capability

  Total amount of global memory:     2817720320 bytes

  64-bit Memory Address:             NO

cuSafeCallNoSync() Driver API error = 0214 from file <matrixMulDrv.cpp>, line 89.

The error code is different; cuda.h says: CUDA_ERROR_ECC_UNCORRECTABLE = 214

also driver reinstall?

Same symptom here, but different cause. When a double-bit (unrecoverable) ECC error is detected, the GPU will (by design) prevent additional work from executing until you manually intervene. This is one of the three or four reasons why the CUDA Runtime might see that GPU as being “unavailable”.

Use nvidia-smi to check and clear the ECC errors, and you should be able to get back to work.