New to Cuda, simple beginnings not working

I am trying a very simple cuda program but I can’t get the kernel function to manipulate the data. The kernel function now just sets the array elements to the number 5 and I want to see if I get it back. I don’t. I am using Visual Studio 2008 on a Del Duo core and Cuda reports that I have Quatro NVS 290 with SM 1.1 and 16 Multi Processors. Any ideas of why the kernel function doesn’t load 5 into all the array elements.

global void incrementArrayOnDevice(float a, int N)
int idx = blockIdx.x
blockDim.x + threadIdx.x;
a[idx] = 5.0;

int main(void)
float *a_h, *b_h; // pointers to host memory
float a_d; // pointer to device memory
int i, N = 12;
size_t size = N
// allocate arrays on host
a_h = (float *)malloc(size);
b_h = (float *)malloc(size);
// allocate array on device
cudaMalloc((void **) &a_d, size);
// initialization of host data
for (i=0; i<N; i++)
a_h[i] = (float)i;
// copy data from host to device
cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);
// do calculation on host
incrementArrayOnHost(a_h, N);

for (i=0; i<N; i++)
printf("%.0f, “, a_h[i]);
// do calculation on device:
// Part 1 of 2. Compute execution configuration
int blockSize = 12;
int nBlocks = 1;//N/blockSize + (N%blockSize == 0?0:1);
// Part 2 of 2. Call incrementArrayOnDevice kernel
incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);
// Retrieve result from device and store in b_h
cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// check results
for (i=0; i<N; i++)
printf(”%.0f, ", b_h[i]);
// assert(a_h[i] == b_h[i]);
// cleanup
free(a_h); free(b_h); cudaFree(a_d);


I test your code on GTX295 and TeslC1060, it works both.

my platform is winxp64, vc2005, driver 190.38, cuda2.3

what’s value of b_h in your system?

Thanks for the reply. b_h in my system is 0, 1, 2, …, 11. I expect it to come back as all 5’s. Since it works on your system my guess is I have something wrong in my configuration for CUDA. By the way my OS is WinXP, 32 bit. I am able to make CUDA calls, that is what I used to see that I had 16 Multiprocessors, Quatro NVS 290, and SM 1.1. So I am able to make calls into the cuda libraries but I am having trouble when I call a kernel function, but it works for you. I wonder if I am missing some fundamental point here. Since CUDA told me that I have 16 multiprocessors and SM 1.1 I am thinking that I can use cuda’s kernel functions, maybe that assumption is wrong. If you have any further thoughts about this I would appreciate hearing them. In the mean time I am going to try to unload and reload the CUDA environment and see if there is something I missed there.

Thanks for your help, I reloaded the CUDA environment and that solved it.

Hi Pat,

I got the same problem when running a simple example on windows xp. Could you give me more details how you reloaded the CUDA environment? Many thanks.