programming on Quadro FX 380 array size issue.

Hi

I wrote a simple code to run a simple add function on quadro fx 380 gpu.
i found once my float array size broke 512, it doesn’t calculate correctly.
could anyone help me solve the issue?

and it seems my device code can only handle float datatype, double type doesn’t work for it.

global void simpleThreadAdd( float* A, float* B, float* C )
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int _tmain(int argc, _TCHAR* argv)
{
const int N = 256*4;
size_t size = N * sizeof( float);

float PI = 3.1415926;

float* A = (float*)malloc(size);
float* B = (float*)malloc(size);
float* C = (float*)malloc(size);

for ( int i = 0; i<N; ++i )
{
	A[i] = 1.0 * i ;
	B[i] = PI;
}

float* d_A;
cudaMalloc((void**)&d_A, size);
float* d_B;
cudaMalloc((void**)&d_B, size);
float* d_C;
cudaMalloc((void**)&d_C, size);

simpleThreadAdd<<<1,N>>>(d_A, d_B, d_C);	

    cudaError_t err = cudaMemcpy( C, d_C, size, cudaMemcpyDeviceToHost );
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;

}

realized the N number cant exceed 512. all good.