Running CUDA Program on Cluster doesn't work


I just start developing CUDA Applications. I compiled the code example I posted at the end with
NSight and Nvidia CUDA Toolkit Version 5.0 and version 4.1.28. Running it on the same device (GeForce GTX 580M) is no problem.

But running it on a cluster(Tesla C2075, installed version is 4.2.9) leads to a problem with the cudaMalloc- and cudaMemCpy-functions.

The error occuring is: “device kernel image is invalid”

How can we solve this?

// example1.cpp : Defines the entry point for the console application.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cutil.h>

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] * a[idx];

// main routine that executes on the host
int main(int argc, char *argv[])
  float *a_h, *a_d, *b_h, *b_d;  // Pointer to host & device arrays
  int N = atoi(argv[1]);
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  b_h = (float *)malloc(size);        // Allocate array on host
  CUDA_SAFE_CALL(cudaMalloc((void **) &a_d, size));   // Allocate array on device
  CUDA_SAFE_CALL(cudaMalloc((void **) &b_d, size));   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  for (int i=0; i<N; i++) b_h[i] = (float)i;
  CUDA_SAFE_CALL(cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice));
  CUDA_SAFE_CALL(cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice));
  // Do calculation on device:
  int block_size = atoi(argv[2]);
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array <<< n_blocks, block_size >>> (a_d, N);
  // Retrieve result from device and store it in host array
  CUDA_SAFE_CALL(cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost));
  CUDA_SAFE_CALL(cudaMemcpy(b_h, b_d, sizeof(float)*N, cudaMemcpyDeviceToHost));

	//Print results
	if ((argc == 4) && !(strcmp(argv[3], "-p"))) {
		int i;
 The results are: 

		for (i = 0; i < N; i++) {
			printf("a[%d] = %f
", i, a_h[i]);


  // Cleanup
  free(a_h); cudaFree(a_d);

I think the GTX570M and the C2075 are both compute capability 2.0, so the problem does not seem to be one of an architecture mismatch at the binary code level.

You seem to have multiple version mismatches in your software stacks, though. Each CUDA version comes with its own version of the CUDA runtime, so when you installed CUDA 5.0 it should have put version number 5.0.x on your machine. The easiest solution would be to install CUDA 5.0 everywhere, i.e. your development machine and the cluster. Make sure you also install a sufficiently recent driver. I believe the CUDA 5.0 installer already includes a matching driver.

If you cannot upgrade the software on the cluster (which appears to be running CUDA 4.1, and presumably has matching drivers installed), you probably won’t be able to run versions 4.2.x or 5.0.x on it since newer versions of the CUDA runtime also require newer drivers. In that case, you would want to build everything with CUDA 4.1. Older CUDA runtimes work just fine on newer drivers so running CUDA 4.1 on your development machine, which appears to have newer drivers, should be fine.