CUDA HPL NaN failed

I have problem using CUDA HPL.

I always got nan failed when my N or NB is large.

The detailed specification of my machines is

nodes : 5
OS : Centos 6.3
CPU : Intel Xeon E5-2670*2 per node
GPU : M2090
Mem : 96GB per node
Infiniband : Mellanox QDR

And the software I use

MPI : MVAPICH2-1.7 ( I also use openmpi-1.4.5 and Intel MPI 4.1 but failed too.)
BLAS : Intel MKL 11.0
CUDA HPL : hpl-2.0_FERMI_v13
Compiler : Intel compiler

================================================================================
N : 140000
NB : 768 896 1024 1152 1280
PMAP : Row-major process mapping
P : 1
Q : 3
PFACT : Right
NBMIN : 8
NDIV : 2
RFACT : Right
BCAST : 2ringM
DEPTH : 1
SWAP : Mix (threshold = 128)
L1 : no-transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words

================================================================================
T/V N NB P Q Time Gflops

WR13R2R8 140000 768 1 3 1328.07 1.377e+03

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0034287 … PASSED

T/V N NB P Q Time Gflops

WR13R2R8 140000 896 1 3 1282.66 1.426e+03

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0036567 … PASSED

T/V N NB P Q Time Gflops

WR13R2R8 140000 1024 1 3 1130.56 1.618e+03

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= nan … FAILED

T/V N NB P Q Time Gflops

WR13R2R8 140000 1152 1 3 1124.15 1.627e+03

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= nan … FAILED

T/V N NB P Q Time Gflops

WR13R2R8 140000 1280 1 3 1127.96 1.622e+03

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= nan … FAILED

Finished 5 tests with the following results:
2 tests completed and passed residual checks,
3 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

It seems that NB below 1024 is ok for N is 140000.

But when I use 5 nodes and set N to 230000 NB to 768, it failed too.

Then I go on testing and find out when N is 230000, only NB is below 512 will passed.

It makes me crazy!!

Can somebody tell me what’s going on.

This thread is rather old, but in order to help those come from search engine, you are probably looking for [url]https://devtalk.nvidia.com/default/topic/541732/nan-and-cuda_error_launch_failed-with-huge-hpl-/[/url]

Be advised NB number larger than 2048 will not be accepted by NV HPL.