error when running a simple CUDA example on windows XP

Hi there,
I am a new born CUPA. I tried a simple CUDA example posted on the
web. The code is suppose to square each element in the array. The
code is successfully compiled, but it did not calculate the square of
the array. Instead, it just printed out each element of the array.

0 0.000
1 1.000

9 9.000

I have checked cudaMemcpy and cudaMalloc, they all passed the test
(cudaSuccess). The code is copied below. Could anyone help me figure
out what the problem is? Many thanks.

Syeteminformation:
Quaudro NVS 140M (has passed the bandwidthTest AND deviceQuary tests)
windows xp
MS C++ express.

// example1.cpp : Defines the entry point for the console
application.
//

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];

}

// main routine that executes on the host
int main(void)
{

float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host

if(cudaMalloc((void **) &a_d, size)!=cudaSuccess) printf(“ad wrong”);

// Allocate array on device
// Initialize host array and copy it to CUDA device

for (int i=0; i<N; i++) a_h[i] = (float)i;
if(cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice) != cudaSuccess)
printf(“Error in memcpy”);

// Do calculation on device:
int block_size = 4;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

square_array <<< n_blocks, block_size >>> (a_d, N);

// Retrieve result from device and store it in host array

if(cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost) !=
cudaSuccess) printf(“Error in memcpy\n”);

// Print results

for (int i=0; i<N; i++) printf(“%d %f\n”, i, a_h[i]);

// Cleanup

free(a_h); cudaFree(a_d);

I test your code on my plateform, winxp64, vc2005, driver 190.38, cuda 2.3,

it works on both GTX295 and TeslaC1060.

Can you show how do you compile your code?