nan in simple vector addition

Axionator · December 12, 2012, 8:54pm

Hi,
I am new to CUDA and struggling with the following:
In a very simple cuda program I want to add two vectors A,B of variable size.
Vector A contains sin(i)*sin(i), vector B cos(i)*cos(i). Finally I calculate
the mean of the resulting sum vector (=1.0).
Whenever I increase the size to n=70.000.000 (on floats), there comes no error
when allocating the memory on the graphics card memory but only a final result
of “nan”. Each cuda statement is checked for cudaSuccess.
I am working on Ubuntu 12.04, cuda-5.0, GeForce GTS 450, driver version 304.64

Any ideas what goes wrong?

nezix · December 12, 2012, 9:51pm

You should have about 1 GB of RAM available on your GPU so 70 millions of float should fit.

Did you check the watchdog timer problem ? Did you stop your X server and tried without ?
Do you use shared memory ?
How do you launch your kernel ? Have you cuda-memcheck"ed" your kernel ?

Axionator · December 12, 2012, 10:24pm

How do I check the watchdog timer problem?
I did not yet stop the X server, but I checked all cuda commands for cudaSuccess.
I allocated memory via malloc and cudaMalloc and copied by cudaMemcpy.
cuda-memcheck gives:
========= Program hit error 9 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/nvidia-current-updates/libcuda.so [0x262eb0]
========= Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
========= Host Frame:./vector_cuda [0x1165]
========= Host Frame:./vector_cuda [0xe72]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
========= Host Frame:./vector_cuda [0xf55]

========= ERROR SUMMARY: 1 error

njuffa · December 13, 2012, 2:13am

In addition to checking the status of each CUDA API call, did you check error status after invoking the CUDA kernel? The CUDA error 9 reported by cuda-memcheck would appear to indicate an invalid launch configuration:

/**
     * This indicates that a kernel launch is requesting resources that can
     * never be satisfied by the current device. Requesting more shared memory
     * per block than the device supports will trigger this error, as will
     * requesting too many threads or blocks. See ::cudaDeviceProp for more
     * device limitations.
     */
    cudaErrorInvalidConfiguration         =      9,

What were the grid and block dimensions used to launch the kernel?

Axionator · December 13, 2012, 11:16am

It depends on the number of dimensions. When I use k=256, I get an error after the kernel execution but
for e.g. n=50.000.000, the final result is correct, whereas for n=30.000.000 it isn’t (mean=0.5)!
For k=1024 its again correct (without error), but for n=70.000.000 an error comes up.

dim3 DimGrid((n-1)/k + 1, 1, 1);
dim3 DimBlock(k, 1, 1);

Are the grid and block dimensions suboptimal? How can I do better?

nezix · December 13, 2012, 12:15pm

You have to take into account the maximum size of your grid. I think the limit of your device is 65536 blocks and 1024 threads by block. You can check that limit with the cudaGetDeviceProperties function.

(50.000.000 - 1) / 256 > 65536
(70.000.000 -1) / 1024 > 65536 but 67.000.000 should be fine on 1024 threads by block.

Moreover, there is a reduction example on the CUDA SDK, maybe you should use it to sum each vector.

Axionator · December 13, 2012, 2:02pm

That makes quite sense, thank you.
But now, i switched from floats to doubles and get another error:
I can (cuda)allocate 36.000.000 doubles and get the right result,
but for n=37.000.000 it fails. Again it’s the kernel (1024 threads/block, n/1024 blocks; that should be ok and works with floats) which returns an error and not cudaMalloc.
According to cudaMemGetInfo there is ~800MB free.
But I don’t know how much of that is really usable for cuda.

Axionator · December 13, 2012, 3:04pm

Ok, found my error: I allocated of course three vectors of size 37.000.000 and that’s too much.

Topic		Replies	Views
Unespected output for a basic program CUDA Programming and Performance	6	928	December 10, 2014
Getting started with CUDA ... cannot add simple vectors CUDA Programming and Performance	9	20928	January 31, 2011
CUDA kernels keep on crashing CUDA Programming and Performance	6	3644	October 27, 2008
The Cuda Programming Guide Samples Errors CUDA Programming and Performance	5	2152	August 26, 2009
Memory management in the device Is there any caching in device's memory? CUDA Programming and Performance	2	3562	September 4, 2008
MyFirstCuda CUDA Programming and Performance	5	4197	February 11, 2010
VectorAdd example from CUDACast #2 CUDA Programming and Performance	3	803	August 20, 2014
Kernel launches blocking when 1024 kernels in a queue CUDA Programming and Performance	3	2651	July 8, 2014
Time consuming comparison between 820M GPU combination code and pure C++ on CPU CUDA Programming and Performance	2	488	May 22, 2019
cuda on quadro NVS 285? simple kernel doesn't work... CUDA Programming and Performance	4	1347	August 23, 2011

nan in simple vector addition

Related topics