For-Loop is not executed


I have a problem with a simple for-loop:

#include <stdio.h> 
#include <cuda.h>

__global__ void test(int *a) {
 for (int i = 0; i < 4; i++) {
   a[i] = i;
   printf("%d %d\n", threadIdx.x, i);

int main(int argc, char **argv) {
  int *a, *b;
  cudaMalloc((void **) &a, 4 * sizeof(int));
  b = (int *) malloc(4 * sizeof(int));
  test <<< 1,1 >>>(a);
  cudaMemcpy(b, a, 4 * sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 4; i++) printf("%d ", b[i]);
  return 0;

Basically, the kernel runs on one thread and executes a for loop that runs from 0 to 3 and sets a value in a global array. Also, some info is printed.

I compiled this code with toolkit 4.2 using
nvcc -gencode=arch=compute_20,code=compute_20 -gencode=arch=compute_20,code=sm_20 -lcudart

The output should be

0 0
0 1
0 2
0 3
0 1 2 3

but instead I get

0 0 
0 0 0 0

I tried the program on a GTX 580 and a GTX 550 Ti. The result is the same. Can anybody explain that to me? I thought about his for a long time, but I just don’t get it.


Edit: I changed the code to include the correct kernel call.

Have you called “test(a)” like a cuda kernel? I mean:

test <<< 1,1 >>>(a);

It works for me.

Yes, the kernel call is test <<< 1,1 >>>(a). That did not show up in the code. Sorry.

Sorry then. This code works for me in a GTX 580, Win7, 32/64 bits. Toolkit 4.2.

0 0
0 1
0 2
0 3
0 1 2 3

To get to the bottom of the failures, I would suggest adding error status checks after every CUDA API call and after every kernel launch.

Thank you for the replies. The problem was caused by the printf in the kernel loop. I ran this example on Linux with toolkit 4.2 and a GTX580. Removing the printf solved the issue.