I’m experiencing some problems with nvidia 410.78 driver, GTX1070.
Launching some cuda programs (bandwidth test in the example, also GPU burn utility) the driver crashes launching the following error:
========= CUDA-MEMCHECK
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1070
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 9890.4
CUDA error at bandwidthTest.cu:626 code=30(cudaErrorUnknown) "cudaHostAlloc((void **)&h_idata, memSize, (wc) ? cudaHostAllocWriteCombined : 0)"
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaHostAlloc.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3572d3]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x43949]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x49ff]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x44aa]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x4399]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x42e5]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x41d5]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x3924]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:eth-pcalc-pcie-bandwidth-test [0x3139]
=========
========= ERROR SUMMARY: 1 error
the name is different, but the binary is exactly the bandwidth test of the CUDA samples.
The problem is reproducible on several systems when X server is not running.
Other systems with same OS, kernel, driver but with 2 GPUs doesn’t show the same behavior.
Attaching nvidia bug report. Can you please advice?
I spoke too soon.
Now sometimes the system works correctly, while sometimes the issue appears again.
After a reboot, if I run in loop bandwidth test I see half of the performance expected, but sometimes increases:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1070
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5079.9
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6121.1
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 191817.4
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1070
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10870.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6090.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 191934.6
Result = PASS
Looking at the PCIe link speed I see it going up and down 2.5GT/8GT.
I stopped and executed again the bandwidth test loop and it crashed with:
[CUDA Bandwidth Test] - Starting...
Running on...
cudaGetDeviceProperties returned 30
-> unknown error
CUDA error at bandwidthTest.cu:242 code=30(cudaErrorUnknown) "cudaSetDevice(currentDevice)"
and in dmesg I see:
[ 123.278619] NVRM: GPU at PCI:0000:87:00: GPU-4198f5a8-a9e6-3fa9-ba96-bf591b3a658c
[ 123.278634] NVRM: GPU Board Serial Number:
[ 123.278641] NVRM: Xid (PCI:0000:87:00): 31, Ch 00000003, intr 10000000. MMU Fault: ENGINE CE2 HUBCLIENT_CE1 faulted @ 0x1_00190000. Fault is of type FAULT_PTE ACCESS_TYPE_WRITE
You have a desktop installed but no monitor connected. So the xserver will start on boot and the nvidia driver will exit instantly so systemd will restart it in a loop. Please either disable the Xserver from starting using
sudo systemctl disable display-manager
or configure the X driver to start without monitor by adding
Option "AllowEmptyInitialConfiguration" "true"
to the device section of your xorg.conf, then reboot and check.
I’m now running the same OS in a similar system, E3 processor and Nvidia GTX 1050ti and i’m experiencing the same issue.
The X server configuration is the one you suggested and the nvidia persistence daemon is enabled.
Those are completely different errors, what kind of workloads were you running when they appeared?
If you have some long-running cuda kernels, you might hit a timeout when running X at the same time. If you don’t need graphics output, just disable the Xserver, otherwise try with adding