nvidia driver problem leading to gpu server freeze and load too high

my cuda version:7.5
nvidia graphic driver version:364.19
the problem is :using grid qhost or the top we can see the load is very high (1K~2K),and the machine freeze

the error log is:
Jul 6 16:04:05 gs01 kernel: [25230.607312] PGD 0
Jul 6 16:04:05 gs01 kernel: [25230.608356] Oops: 0000 [#1] SMP
Jul 6 16:04:05 gs01 kernel: [25230.618554] CPU: 0 PID: 8156 Comm: nnet3-train Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u2
Jul 6 16:04:05 gs01 kernel: [25230.619755] Hardware name: Sugon W780-G20/S7079GM2NR-N, BIOS V7.101 08/20/2015
Jul 6 10:15:33 gs01 kernel: [ 4330.336550] [drm] [nvidia-drm] [GPU ID 0x00008b00] Loading driver
Jul 6 10:15:33 gs01 kernel: [ 4330.336643] [drm] [nvidia-drm] [GPU ID 0x00009000] Loading driver
Jul 6 10:15:33 gs01 kernel: [ 4330.336716] [drm] [nvidia-drm] [GPU ID 0x00009100] Loading driver
Jul 6 10:15:33 gs01 kernel: [ 4330.336783] [drm] [nvidia-drm] [GPU ID 0x00009400] Loading driver
Jul 6 10:15:33 gs01 kernel: [ 4330.336870] [drm] [nvidia-drm] [GPU ID 0x00009500] Loading driver
Jul 6 12:26:37 gs01 kernel: [12189.808822] perf interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Jul 6 13:41:28 gs01 kernel: [16678.923424] nvidia-uvm: Loaded the UVM driver in lite mode, major device number 247
Jul 6 13:44:10 gs01 kernel: [16840.505166] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.237 msecs
Jul 6 13:45:22 gs01 kernel: [16912.514907] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.641 msecs
Jul 6 13:50:55 gs01 kernel: [17245.673603] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.651 msecs
Jul 6 13:50:55 gs01 kernel: [17245.673605] perf interrupt took too long (14122 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Jul 6 13:52:22 gs01 kernel: [17332.234411] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.655 msecs
Jul 6 14:25:29 gs01 kernel: [19318.125307] perf interrupt took too long (14041 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
Jul 6 16:04:05 gs01 kernel: [25230.607312] PGD 0
Jul 6 16:04:05 gs01 kernel: [25230.608356] Oops: 0000 [#1] SMP
Jul 6 16:04:05 gs01 kernel: [25230.609392] Modules linked in: nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver LeoFS(PO) cpufreq_conservative cpufreq_powersave cpufreq_stats cpufreq_userspace LeoNET(PO) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc x86_pkg_temp_thermal intel_powerclamp xfs intel_rapl coretemp kvm_intel kvm crc32_pclmul ast aesni_intel aes_x86_64 ttm iTCO_wdt lrw gf128mul glue_helper ablk_helper evdev cryptd mei_me mei iTCO_vendor_support lpc_ich mfd_core drm_kms_helper drm pcspkr i2c_i801 shpchp processor acpi_power_meter thermal_sys ipmi_si ipmi_msghandler wmi acpi_pad tpm_tis tpm button fuse autofs4 hid_generic usbhid hid ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common ahci libahci bnx2x mdio igb libcrc32c libata crc32c_generic i2c_algo_bit crc32c_intel i2c_core ehci_pci xhci_hcd ehci_hcd dca ptp scsi_mod usbcore pps_core usb_common [last unloaded: nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.618554] CPU: 0 PID: 8156 Comm: nnet3-train Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u2
Jul 6 16:04:05 gs01 kernel: [25230.619755] Hardware name: Sugon W780-G20/S7079GM2NR-N, BIOS V7.101 08/20/2015
Jul 6 16:04:05 gs01 kernel: [25230.620950] task: ffff88203c5eaca0 ti: ffff88203ea64000 task.ti: ffff88203ea64000
Jul 6 16:04:05 gs01 kernel: [25230.622144] RIP: 0010:[] [] _nv000168rm+0x58/0x210 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.623443] RSP: 0018:ffff88203ea67bb0 EFLAGS: 00010246
Jul 6 16:04:05 gs01 kernel: [25230.624645] RAX: 0000000000000000 RBX: 00000000c1d004d0 RCX: ffff88103e5ffaa8
Jul 6 16:04:05 gs01 kernel: [25230.625842] RDX: ffff88103e5ffaa4 RSI: ffff88103e5ffa88 RDI: 00000000c1d004d0
Jul 6 16:04:05 gs01 kernel: [25230.627041] RBP: ffff88103e5ffa98 R08: 0000000000000001 R09: 0000000000000020
Jul 6 16:04:05 gs01 kernel: [25230.628232] R10: ffff88103e5ffa40 R11: ffffffffa2104dc0 R12: ffff881f57bd6008
Jul 6 16:04:05 gs01 kernel: [25230.629409] R13: 00000000c1d004d0 R14: 000000005c000094 R15: 0000000000000001
Jul 6 16:04:05 gs01 kernel: [25230.630594] FS: 00002b40f24c61c0(0000) GS:ffff88107f800000(0000) knlGS:0000000000000000
Jul 6 16:04:05 gs01 kernel: [25230.631776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 6 16:04:05 gs01 kernel: [25230.632944] CR2: 000000000000000c CR3: 0000000001813000 CR4: 00000000001407f0
Jul 6 16:04:05 gs01 kernel: [25230.634118] Stack:
Jul 6 16:04:05 gs01 kernel: [25230.635273] 00000000c1d004d0 ffffffffa250bd80 00000000c1d004d0 ffff880f47903808
Jul 6 16:04:05 gs01 kernel: [25230.636464] 0000000000000001 ffffffffa20ae58d ffff881f5155b4c8 ffffffffa20a9b01
Jul 6 16:04:05 gs01 kernel: [25230.637648] ffff880f47903810 ffffffffa250bd80 000000000000004d ffff88103e5ffb08
Jul 6 16:04:05 gs01 kernel: [25230.638824] Call Trace:
Jul 6 16:04:05 gs01 kernel: [25230.640054] [] ? _nv000216rm+0x2d/0x60 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.641299] [] ? _nv003930rm+0x2c1/0x350 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.642527] [] ? _nv017180rm+0x314/0x390 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.643745] [] ? _nv017184rm+0x6a/0xb0 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.644960] [] ? _nv000746rm+0x95/0x2940 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.646158] [] ? _nv000710rm+0x1b09/0x1bc0 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.647337] [] ? _nv016605rm+0x52/0x80 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.648514] [] ? _nv000142rm+0x4b/0x60 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.649666] [] ? _nv010120rm+0x115/0x1a0 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.650820] [] ? _nv000862rm+0x153/0x260 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.651944] [] ? rm_shutdown_adapter+0xc8/0xf0 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.653017] [] ? nv_close_device+0x121/0x140 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.654090] [] ? nvidia_close+0xda/0x2e0 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.655151] [] ? nvidia_frontend_close+0x27/0x50 [nvidia]
Jul 6 16:04:05 gs01 kernel: [25230.656163] [] ? __fput+0xca/0x1d0
Jul 6 16:04:05 gs01 kernel: [25230.657165] [] ? task_work_run+0x8c/0xb0
Jul 6 16:04:05 gs01 kernel: [25230.658162] [] ? do_exit+0x2b1/0xa50
Jul 6 16:04:05 gs01 kernel: [25230.659150] [] ? signal_wake_up_state+0x1a/0x30