Nvidia HPL - GPU freezes if using larger problem size (< memory size)

Hi, I am using Nvidia HPL to do some benchmark. I works well with smaller problem size (N) like 9600 or 15000, while the GPU freezes if I use larger problem size like 20000. The estimated memory size with N=20000 is 20000 * 20000 * 8 = 2.98GB which is smaller than main memory size (16GB) and GPU memory size (6GB).

Environment

  • Intel(R) Xeon(R) CPU E5506 x 2
  • Tesla C2070 x 4
  • Ubuntu 14.04
  • Cuda 7.5

The nvidia-smi command also froze after running HPL.

HPL.dat file

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
20000 Ns
1             # of NBs
1024 768 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1        Ps
3        Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 0          DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
1            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

dmesg info

[ 1423.979936]  [<ffffffffc15fa429>] nv_map_dma_map_scatterlist+0x99/0x110 [nvidia]
[ 1423.980023]  [<ffffffffc15fa8ef>] nv_dma_map_pages+0x19f/0x380 [nvidia]
[ 1423.980110]  [<ffffffffc15ff91b>] ? os_mem_set+0x1b/0x30 [nvidia]
[ 1423.980196]  [<ffffffffc15fac70>] nv_dma_map_alloc+0xd0/0x200 [nvidia]
[ 1423.980282]  [<ffffffffc15e308f>] _nv012676rm+0x1ff/0x2b0 [nvidia]
[ 1423.980373]  [<ffffffffc15ad0e9>] ? _nv010785rm+0x879/0x9b0 [nvidia]
[ 1423.980519]  [<ffffffffc140bb45>] ? _nv011383rm+0x155/0x220 [nvidia]
[ 1423.980605]  [<ffffffffc15f0e51>] ? _nv012569rm+0x231/0x5c0 [nvidia]
[ 1423.980690]  [<ffffffffc15f0ccc>] ? _nv012569rm+0xac/0x5c0 [nvidia]
[ 1423.980775]  [<ffffffffc15f0c74>] ? _nv012569rm+0x54/0x5c0 [nvidia]
[ 1423.980869]  [<ffffffffc15984f8>] ? _nv000804rm+0x3128/0x37a0 [nvidia]
[ 1423.980963]  [<ffffffffc15984b3>] ? _nv000804rm+0x30e3/0x37a0 [nvidia]
[ 1423.981056]  [<ffffffffc1595513>] ? _nv000804rm+0x143/0x37a0 [nvidia]
[ 1423.981153]  [<ffffffffc1571f74>] ? _nv002284rm+0xbb4/0x3990 [nvidia]
[ 1423.981240]  [<ffffffffc15dd811>] ? _nv000719rm+0x281/0x440 [nvidia]
[ 1423.981327]  [<ffffffffc15dd997>] ? _nv000719rm+0x407/0x440 [nvidia]
[ 1423.981413]  [<ffffffffc15de0f8>] ? _nv000696rm+0x728/0x7d0 [nvidia]
[ 1423.981498]  [<ffffffffc15e8153>] ? rm_ioctl+0x73/0x100 [nvidia]
[ 1423.981590]  [<ffffffffc15f650d>] ? nvidia_ioctl+0x13d/0x430 [nvidia]
[ 1423.981675]  [<ffffffffc15f4cff>] ? nvidia_frontend_ioctl+0x2f/0x70 [nvidia]
[ 1423.981761]  [<ffffffffc15f4d5d>] ? nvidia_frontend_unlocked_ioctl+0x1d/0x30 [nvidia]

Is there anyone having an idea what the problem is?

Thanks,
– LZ

The problem is solved by disabling IOMMU.