The GPU can only operate on data that has been copied to the device memory. You need to allocate a device version of your A, B, and C arrays and copy data to/from them as needed. (cudaMalloc and cudaMemcpy are the important functions here)
Your kernel call actually failed when the GPU tried to access host pointers, but since you never check error messages in this code, that would not have been apparent.
To see a simple example which includes cudaMalloc and cudaMemcpy, see this article from Dr. Dobbs:
Thank you very much seibert, actually a friend of mine who is pretty good at cuda told me about that a couple of hours after I made the post; eventually we decided that the best thing for me was to continue reading the programming manual that comes with cuda External Media (and in fact, it clarified those aspects as I went on reading)
The problem is that I stopped at the first example in the manual:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Kernel invocation
VecAdd<<<1, N>>>(A, B, C);
}
and tried to modify it and run it, without knowing much about cuda. My fault :">