invalid device function, all CUDA-capable devices are busy or unavailable

I met some issues on cuda code running on linux.

I am not sure they are related to the super or not.

I checked the LD_LIBRARY_PATH, and both RC3.1 and RC4.0 have the same issues.

anyone can help me?

thank you very much.

issue 1: when run SDK exmaple, it shows 'Runtime API error : all CUDA-capable devices are busy or unavailable’

[jwang@super-496 ~]$ cd NVIDIA_GPU_Computing_SDK_4.0/C/bin/linux/release

[jwang@super-496 release]$ echo $LD_LIBRARY_PATH


[jwang@super-496 release]$ ./histogram


./histogram Starting…

CUDA device [Tesla C1060] has 30 Multi-Processors, Compute 1.3

Initializing data…

…allocating CPU memory.

…generating input data

…allocating GPU memory and copying input data

main.cpp(102) : cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable.

issue 2: my cuda code worked correctly two weeks ago, but now shows error of “invalid device function” after launch the kernel, even launch a very simple or empty kernel.

[jwang@super-496 HHT]$ ./Test_HHT_CUDA

Cuda error: INDEX_CUDA: invalid device function.

I using a mbp5 with MAC OS X SL 10.6.7, Geforce 9400M-256MB
CMake 2.8 and CUDA 4.0.

And I get exactly the same error. I cann’t run a kernel, even a empty one without catching the error.

The relevant part is:
[indent]global void kernel_SetToOne(float a, int nx, int ny, int pitch) {
[indent] int c = blockIdx.x * blockDim.x + threadIdx.x;
int r = blockIdx.y * blockDim.y + threadIdx.y;
if (c < nx && r < ny) {
[indent] int index = c
pitch + r;
a[index] = 1.;
[/indent] }

int nx = 256;
int ny = 256;
flaot* matrix;
cudaMalloc( (void **)matrix, nxnysizeof(float) );

kernel_SetToOne <<<dim3(16, 16), dim3(16,16) >>>( matrix, nx, ny, nx);[/indent]

NVidia never provided a solution to this very common problem. Bad NVidia!!

try the things mentioned here:

I found a solution.

I did compile with the “-arch=sm_13” option. But the CUDA System of a mbp5 only allow architecture 1.1:

deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 9400M"

  CUDA Driver Version:                           4.0

  CUDA Runtime Version:                          4.0

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 265945088 bytes

  Multiprocessors x Cores/MP = Cores:            2 (MP) x 8 (Cores/MP) = 16 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.10 GHz

  Concurrent copy and execution:                 No

  Run time limit on kernels:                     Yes

  Integrated:                                    Yes

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce 9400M

I’m not sure but it seems to be a kind of architecture development

After changing all from


entrys to


and recompiling the hole thing, it works fine.

I suddenly got that error too and not much info…weird thing is the code was running perfectly before i added a kernel. The added kernel caused this error. I switched the code back to the initial state and the invalid device function error persists…!!!

just to answer my own question. The function that caused the above is this:

__global__ void kFindNumBlocks(cudaGridSimulation_t cGridSimIn, cudaNBSimulation_t cSimIn) {
	if (threadIdx.x == 0) {
		uint flatCounter = 0;
		for (int i = 0; i < cGridSimIn.xs * cGridSimIn.ys; ++i) {
			uint zOffset = cGridSimIn.offsets[i];
			if (zOffset > 0) {
				uint numZs = getNumZs2(zOffset, 256);
				for (int j = 0; j < numZs; ++j) {
		cSimIn.numBlocks = flatCounter;
		printf("nd: %d\n", cSimIn.numBlocks);

The kernel call was commented out while the error was present. The error dissapeared when i deleted(commented out) the above code from the cu file …hmmm