Kernel Launch Failure Very simple kernel


I wanted to try the cuda programation so I have found a very simple code with a very simple kernel to try it.

her is the code suppose to square an array

#include "cuda/cuda_runtime.h"

#include "cuda/cuda.h"

#include <iostream>

#include <algorithm>

#include <iterator>

#include <sys/time.h>

#include <stdio.h>

// Kernel that executes on the CUDA device

__global__ void square_array(float *a, int N)


  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx<N) a[idx] = a[idx] * a[idx];


// main routine that executes on the host

int main(void)


  float *a_h, *a_d;  // Pointer to host & device arrays

  const int N = 10;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  a_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &a_d, size);   // Allocate array on device

  // Initialize host array and copy it to CUDA device

  for (int i=0; i<N; i++) a_h[i] = (float)i;

  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

  // Do calculation on device:

  int block_size = 4;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

  square_array <<< n_blocks, block_size >>> (a_d, N);

  // Retrieve result from device and store it in host array

  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  // Print results

  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

  // Cleanup

  free(a_h); cudaFree(a_d);


The problem is that the kernel is not lauched, the output array is not modified and with cuda getlasterror I know that the kernel is not lauched but I don’t know why.

I’m running on CentOs 6.1

Here are the specs of my nvidia card which should be enough


CUDA Device Query...

There are 1 CUDA devices.

CUDA Device #0

Major revision number:         1

Minor revision number:         1

Name:                          Quadro NVS 295

Total global memory:           267714560

Total shared memory per block: 16384

Total registers per block:     8192

Warp size:                     32

Maximum memory pitch:          2147483647

Maximum threads per block:     512

Maximum dimension 0 of block:  512

Maximum dimension 1 of block:  512

Maximum dimension 2 of block:  64

Maximum dimension 0 of grid:   65535

Maximum dimension 1 of grid:   65535

Maximum dimension 2 of grid:   1

Clock rate:                    1300000

Total constant memory:         65536

Texture alignment:             256

Concurrent copy and execution: No

Number of multiprocessors:     1

Kernel execution timeout:      Yes

Press any key to exit...

If I use cudacheck error I have the following message, the only problem is that I don’t have any device…

cudaCheckError() failed at : invalid device function .

I have looked at several forums but didn’t find any answer. I manage to copy and retrieve from the gpu memory and everything, the only thin is that it fail to launch the kernel.

Thank you in advance for your help, feel free to ask me if you need more information.


You should pass the address of “a_d” shouldn´t you?

square_array <<< n_blocks, block_size >>> (&a_d, N);



No he should not.

To the original poster: if you are getting invalid device function, that usually means that you are trying to run code which has been compiled for the wrong architecture compared with what you are trying to run it on. Could you edit how you are compiling this code into your original question, and add what CUDA version and OS you are using?

Thank you for your quick and accurate answer avidday, it was indeed the architecture which was the problem.

So I’ve change the option of my makefile and it worked so thank you again.

For those who have the same problem the option at the compilation with nvcc to change the architecture is :

nvcc -arch=compute_13 -code=sm_13
  • arch specifies the virtual arquictecture, which can be compute_10, compute_11, etc.

    • code specifies the real architecture, which can be sm_10, sm_11, etc.