GTX1070 + 410.78 Driver, Xid 31

I’m experiencing some problems with nvidia 410.78 driver, GTX1070.
Launching some cuda programs (bandwidth test in the example, also GPU burn utility) the driver crashes launching the following error:

[ 47.125164] NVRM: GPU at PCI:0000:87:00: GPU-4198f5a8-a9e6-3fa9-ba96-bf591b3a658c
[ 47.125169] NVRM: GPU Board Serial Number:
[ 47.125174] NVRM: Xid (PCI:0000:87:00): 31, Ch 00000003, engmask 00000110, intr 10000000

I run cuda-memcheck the report is the following:

========= CUDA-MEMCHECK
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			9890.4

CUDA error at bandwidthTest.cu:626 code=30(cudaErrorUnknown) "cudaHostAlloc((void **)&h_idata, memSize, (wc) ? cudaHostAllocWriteCombined : 0)" 
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaHostAlloc. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3572d3]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x43949]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x49ff]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x44aa]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x4399]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x42e5]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x41d5]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x3924]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:eth-pcalc-pcie-bandwidth-test [0x3139]
=========
========= ERROR SUMMARY: 1 error

the name is different, but the binary is exactly the bandwidth test of the CUDA samples.

The problem is reproducible on several systems when X server is not running.
Other systems with same OS, kernel, driver but with 2 GPUs doesn’t show the same behavior.

Attaching nvidia bug report. Can you please advice?

nvidia-bug-report.log.gz (1010 KB)

Please enable the nvidia-persistenced to start on boot and check if that resolves the issue.

I confirm the resolution, after enabling nvidia-persistenced the issue is gone. Many thanks!

I spoke too soon.
Now sometimes the system works correctly, while sometimes the issue appears again.
After a reboot, if I run in loop bandwidth test I see half of the performance expected, but sometimes increases:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			5079.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6121.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191817.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			10870.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6090.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191934.6

Result = PASS

Looking at the PCIe link speed I see it going up and down 2.5GT/8GT.

I stopped and executed again the bandwidth test loop and it crashed with:

[CUDA Bandwidth Test] - Starting...
Running on...

cudaGetDeviceProperties returned 30
-> unknown error
CUDA error at bandwidthTest.cu:242 code=30(cudaErrorUnknown) "cudaSetDevice(currentDevice)"

and in dmesg I see:

[  123.278619] NVRM: GPU at PCI:0000:87:00: GPU-4198f5a8-a9e6-3fa9-ba96-bf591b3a658c
[  123.278634] NVRM: GPU Board Serial Number: 
[  123.278641] NVRM: Xid (PCI:0000:87:00): 31, Ch 00000003, intr 10000000. MMU Fault: ENGINE CE2 HUBCLIENT_CE1 faulted @ 0x1_00190000. Fault is of type FAULT_PTE ACCESS_TYPE_WRITE

nvidia-bug-report.log.gz (630 KB)

Please create a new nvidia-bug-report.log and attach it.

Attached the new report.
P.S. in order to install the cuda-memcheck I updated the driver version to 418.56.

You attached the old one.

Oh, sorry! I replaced it now with the new one.

Best regards,
Filippo.

You have a desktop installed but no monitor connected. So the xserver will start on boot and the nvidia driver will exit instantly so systemd will restart it in a loop. Please either disable the Xserver from starting using

sudo systemctl disable display-manager

or configure the X driver to start without monitor by adding

Option "AllowEmptyInitialConfiguration" "true"

to the device section of your xorg.conf, then reboot and check.

Hi Generix,

Thanks for your support, I definitely confirm the resolution of the issue.

Best,
Filippo.

hi Generix,

I’m now running the same OS in a similar system, E3 processor and Nvidia GTX 1050ti and i’m experiencing the same issue.
The X server configuration is the one you suggested and the nvidia persistence daemon is enabled.

Attached the nvidia bug report.

Can you please advice?
thanks.
nvidia-bug-report.log.gz (891 KB)

Those are completely different errors, what kind of workloads were you running when they appeared?
If you have some long-running cuda kernels, you might hit a timeout when running X at the same time. If you don’t need graphics output, just disable the Xserver, otherwise try with adding

Option "Interactive" "0"

to the device section of your xorg.conf.