waiting on variable from host fails

When I run this code:

[codebox]global void kernel(int *arr) {

    int i;

    for(i = 0;i < 10; i++) {

            while(arr[i] == 0) ;

    }

}

int main() {

    int *d_arr;

    int arr[10] = {0,0,0,0,0,0,0,0,0,0};

    cudaMalloc((void **)&d_arr, sizeof(int) * 10);

    cudaMemcpy(d_arr, arr, sizeof(int) * 10, cudaMemcpyHostToDevice);

    kernel<<<1,256>>>(d_arr);

    cudaThreadSynchronize();

    return 0;

}

[/codebox]

the program seems to exit instantly, when it seems to me this should be an infinite loop on the device, with the host waiting for it. Even adding printf(“%s\n”,cudaGetErrorString(cudaGetLastError())); at the end of the host code detects no error. Does anyone know of an explanation for this?

Thanks for any help or suggestions.

I haven’t tried compiling this myself to be sure, but have you looked at the PTX for this kernel? The compiler is very good at removing dead code – by which I essentially mean code that does not contribute to a write to global memory. In that sense, this entire kernel is “dead”, so I wouldn’t be too surprised if the compiler was just making the whole thing a noop…

And of course it’s also possible that the answer is even more simple: there could be a CUDA error being returned from one of these functions that you’re not catching. Add checks for errors (at least in debug mode) to be sure.

Thanks for the reply, that was exactly the issue. Added a write to the loop and the device code doesn’t return. Thanks very much!