CUDA kernel never executed - SOLVED

Hi guys,

I am experiencing strange behaviour when trying to lunch code on Tesla C1060 and Quadro FX3700. Basically, no matter what kernel I am trying to lunch the application returns the same data that was initialised on host. In other words, the data was not affected by the kernel at all.

I am running Ubuntu 9.4 but tested on 10.04 with the same result. I also tried using the latest drivers as well as the previous release but no luck. All my code executes fine in our laboratory where I have C1060s and Ubuntu 9.04 too.

Could thid be a hardware issue of my PC? Perhaps something that could be modified in BIOS?

As a simple example:

[codebox]#include "stdafx.h"

#include <stdio.h>

#include <cuda.h>


// Kernel that executes on the CUDA device

__global__ void square_array(float *a, int N)


  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx<N) a[idx] = a[idx] * a[idx];



// main routine that executes on the host

int main(void)


  float *a_h, *a_d;  // Pointer to host & device arrays

  const int N = 10;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  a_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &a_d, size);   // Allocate array on device

  // Initialize host array and copy it to CUDA device

  for (int i=0; i<N; i++) a_h[i] = (float)i;

  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

  // Do calculation on device:

  int block_size = 4;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

  square_array <<< n_blocks, block_size >>> (a_d, N);

  // Retrieve result from device and store it in host array

  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  // Print results

  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

  // Cleanup

  free(a_h); cudaFree(a_d);


and the result is:

[codebox]0 0.000000

1 1.000000

2 2.000000

3 3.000000

4 4.000000

5 5.000000

6 6.000000

7 7.000000

8 8.000000

9 9.000000


I checked for the errors, however, none were returned. I also tried to synchronise all threads after kernel lunch but that made no difference.

This is just one example that does not return correct values. Strangely enough the compilation of SDK went without problems and I can run most of the demos, however, some like deviceQuery and few others would not show anything. In addition, particles, mandelbrot and few others would freeze my system.

Could you please point me to the right direction because at the moment I am stuck :D

Thank you,

Martin Peniak


Solved the problem by updating to latest development drivers. However, even after the upgrade some of my code would not work and I found that the problem was using doubles.

ptxas /tmp/tmpxft_00000f1f_00000000-2_cuda_som.ptx, line 83; warning : Double is not supported. Demoting to float

so I assumed it was demoted to float and all should be good, however, this is not what happens since the code only works when I manually changed all double to float.

Martin Peniak
PhD candidate in AI, The University of Plymouth