Hello!
I have a problem with a simple for-loop:
#include <stdio.h>
#include <cuda.h>
__global__ void test(int *a) {
for (int i = 0; i < 4; i++) {
a[i] = i;
printf("%d %d\n", threadIdx.x, i);
}
}
int main(int argc, char **argv) {
int *a, *b;
cudaMalloc((void **) &a, 4 * sizeof(int));
b = (int *) malloc(4 * sizeof(int));
test <<< 1,1 >>>(a);
cudaMemcpy(b, a, 4 * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 4; i++) printf("%d ", b[i]);
printf("\n");
cudaFree(a);
free(b);
return 0;
}
Basically, the kernel runs on one thread and executes a for loop that runs from 0 to 3 and sets a value in a global array. Also, some info is printed.
I compiled this code with toolkit 4.2 using
nvcc -gencode=arch=compute_20,code=compute_20 -gencode=arch=compute_20,code=sm_20 -lcudart
The output should be
0 0
0 1
0 2
0 3
0 1 2 3
but instead I get
0 0
0 0 0 0
I tried the program on a GTX 580 and a GTX 550 Ti. The result is the same. Can anybody explain that to me? I thought about his for a long time, but I just don’t get it.
Volker
Edit: I changed the code to include the correct kernel call.