Random NaN result from cuda alogrithm

Hi, I’m solve an non-linear iterative problem using non-linear conjugate gradient, and in each iteration, some of the computations are done on GPUs, and the GPU functions I’m using are cuda fft real2complex/complex2real transforms and cublasDnrm2, cublasSnrm2 functions.
The problem I have is that I’ll get NaN result in some of the intermediate results at certain iteration then everything become NaN after that, and the iteration I started to get such NaN varies randomly,
Sometimes I can have the whole code run to completion without getting any NaNs.

So I was wondering if someone have ever had such problems.
I’m work on ubunto using Teslas1070, cuda2.0

Thanks

If the code is producing variable results or failing when run on the same input data, it is usually a sign that you are using some uninitialized memory somewhere, and the variation (or appearance of NaNs) is related to the contents of the uninitialized memory at run time.

Hi mate,

I am doing kind of similar project, non-linear and time-dependant and also haveing this NaN problem. I believe that you implemented the code correctly, i.e. it is working on CPU side (in case you build a competitive C version of you code). Well, my solution to this was to split the algorithm into several kernels and play a bit Lego with them. With a couple of Lego pieces - I remember from the past - you can build an almost infinite amount of structures, therefore my advice is to go down to the guts and make sure everything is NNaN, i.e. Non-NaN and afterwards increment and “optimize” your implementation.

The idea given by avidday to initialize arrays is OKish and I like it. I will consider it for future developments.

For my algorithm, I found very interesting and sobering results and hope to publish them one day, so good luck!

Cheers,

Lx

PS

There are some research people working on CG-GPU-implementation at the moment…

Hi, avidday:

 Thanks for the advice, I'm working on it,

Thanks for the advice, I’m working on it

As an added tip, running your code in emulation mode with valgrind can be a good way of finding where uninitialised variables are read. Of course, finding out why said variables were uninitialised can still take a while…

I am having a similar problem. My program returns NaN results occasionally. Have you guys figured out your problem? Maybe I can understand where to look for problems from your experience.

You can run the following command to check if initialization is an issue.

cuda-memcheck --tool initcheck <executable>

I would like to ask a question: does the host wait until a kernel launch finishes to deal with the next steps? If not, then problem could occur if host uses a variable that is not yet assigned a valid value from the device.

No. The host code launches the kernel and then proceeds to the next command.

https://devblogs.nvidia.com/even-easier-introduction-cuda/

Use a synchronization function like cudaDeviceSynchronize to halt host code until the GPU is finished working.

Thanks! Is synchronization implicitly used in cuSolver and cuBlas functions? If not, then maybe I should explicitly use it between uses of cuSolver and cublas functions.

https://stackoverflow.com/questions/22988733/cublas-synchronization-best-practices

Thanks! I read this yesterday. Just for a confirmation.

I just located the bug location:

CUSOLVER_CALL(cusolverDnCgeqrf(cusolverH,numNod+NUMCHIEF,numNod,A_d,numNod+NUMCHIEF,
tau_d,workspace_d,lwork,deviceInfo_d));
CUDA_CALL(cudaMemcpy(&deviceInfo,deviceInfo_d,sizeof(int),cudaMemcpyDeviceToHost));
if(deviceInfo!=0) {
printf(“QR decomposition failed.\n”);
return EXIT_FAILURE;
}
CUDA_CALL(cudaMemcpy(A,A_d,(numNod+NUMCHIEF)numNodsizeof(cuFloatComplex),cudaMemcpyDeviceToHost));
HOST_CALL(CheckNanInMat(A,numNod+NUMCHIEF,numNod,numNod+NUMCHIEF));
CUDA_CALL(cudaDeviceSynchronize());

//B = (Q^H)*B
CUSOLVER_CALL(cusolverDnCunmqr(cusolverH,CUBLAS_SIDE_LEFT,CUBLAS_OP_C,numNod+NUMCHIEF,numSrc,
        numNod,A_d,numNod+NUMCHIEF,tau_d,B_d,numNod+NUMCHIEF,workspace_d,lwork,deviceInfo_d));
CUDA_CALL(cudaMemcpy(&deviceInfo,deviceInfo_d,sizeof(int),cudaMemcpyDeviceToHost));
if(deviceInfo!=0) {
    printf("QR decomposition failed.\n");
    return EXIT_FAILURE;
}
CUDA_CALL(cudaMemcpy(B,B_d,(numNod+NUMCHIEF)*numSrc*sizeof(cuFloatComplex),cudaMemcpyDeviceToHost));
HOST_CALL(CheckNanInMat(B,numNod,numSrc,numNod+NUMCHIEF));

The step B = (Q^H)*B fails occasionally. All inputs are the same among different runs.