nan in simple vector addition

I am new to CUDA and struggling with the following:
In a very simple cuda program I want to add two vectors A,B of variable size.
Vector A contains sin(i)*sin(i), vector B cos(i)*cos(i). Finally I calculate
the mean of the resulting sum vector (=1.0).
Whenever I increase the size to n=70.000.000 (on floats), there comes no error
when allocating the memory on the graphics card memory but only a final result
of “nan”. Each cuda statement is checked for cudaSuccess.
I am working on Ubuntu 12.04, cuda-5.0, GeForce GTS 450, driver version 304.64

Any ideas what goes wrong?

You should have about 1 GB of RAM available on your GPU so 70 millions of float should fit.

Did you check the watchdog timer problem ? Did you stop your X server and tried without ?
Do you use shared memory ?
How do you launch your kernel ? Have you cuda-memcheck"ed" your kernel ?

How do I check the watchdog timer problem?
I did not yet stop the X server, but I checked all cuda commands for cudaSuccess.
I allocated memory via malloc and cudaMalloc and copied by cudaMemcpy.
cuda-memcheck gives:
========= Program hit error 9 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/nvidia-current-updates/ [0x262eb0]
========= Host Frame:/usr/local/cuda-5.0/lib64/ (cudaLaunch + 0x242) [0x2f592]
========= Host Frame:./vector_cuda [0x1165]
========= Host Frame:./vector_cuda [0xe72]
========= Host Frame:/lib/x86_64-linux-gnu/ (__libc_start_main + 0xed) [0x2176d]
========= Host Frame:./vector_cuda [0xf55]

========= ERROR SUMMARY: 1 error

In addition to checking the status of each CUDA API call, did you check error status after invoking the CUDA kernel? The CUDA error 9 reported by cuda-memcheck would appear to indicate an invalid launch configuration:

     * This indicates that a kernel launch is requesting resources that can
     * never be satisfied by the current device. Requesting more shared memory
     * per block than the device supports will trigger this error, as will
     * requesting too many threads or blocks. See ::cudaDeviceProp for more
     * device limitations.
    cudaErrorInvalidConfiguration         =      9,

What were the grid and block dimensions used to launch the kernel?

It depends on the number of dimensions. When I use k=256, I get an error after the kernel execution but
for e.g. n=50.000.000, the final result is correct, whereas for n=30.000.000 it isn’t (mean=0.5)!
For k=1024 its again correct (without error), but for n=70.000.000 an error comes up.

dim3 DimGrid((n-1)/k + 1, 1, 1);
dim3 DimBlock(k, 1, 1);

Are the grid and block dimensions suboptimal? How can I do better?

You have to take into account the maximum size of your grid. I think the limit of your device is 65536 blocks and 1024 threads by block. You can check that limit with the cudaGetDeviceProperties function.

(50.000.000 - 1) / 256 > 65536
(70.000.000 -1) / 1024 > 65536 but 67.000.000 should be fine on 1024 threads by block.

Moreover, there is a reduction example on the CUDA SDK, maybe you should use it to sum each vector.

That makes quite sense, thank you.
But now, i switched from floats to doubles and get another error:
I can (cuda)allocate 36.000.000 doubles and get the right result,
but for n=37.000.000 it fails. Again it’s the kernel (1024 threads/block, n/1024 blocks; that should be ok and works with floats) which returns an error and not cudaMalloc.
According to cudaMemGetInfo there is ~800MB free.
But I don’t know how much of that is really usable for cuda.

Ok, found my error: I allocated of course three vectors of size 37.000.000 and that’s too much.