4x RTX 4090 (PL 360W) Training Rig Crashes - PCIe Errors & Insufficient Power Supply?

  • Setup: Lenovo ThinkStation PX, 4x RTX 4090 (Power Limit set to 360W), Linux (e.g., Ubuntu 22.04), NVIDIA Driver [insert version, e.g., 550.54.14]

  • Issue: Rig crashes suddenly after 24hrs~240hrs of continuous deep learning training. Forced reboot required; no prior warnings (GPU temp: 70-85℃). I recently increased the GPU power limit to 360W, suspecting insufficient power supply might be the cause.

  • Key Kernel Log Errors (after reboot):

    plaintext

    [    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
    [    5.315691] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
    [    5.315692] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    [    5.315695] {1}[Hardware Error]:   section_type: PCIe error, port_type: legacy PCI end point
    [    5.315697] {1}[Hardware Error]:   device_id: 0000:71:00.0, vendor_id: 0x1b21, device_id: 0x2142
    [    5.315699] {1}[Hardware Error]:   device_id: 0000:72:00.0, vendor_id: 0x1b21, device_id: 0x2142
    [    5.315708] {1}[Hardware Error]:   device_id: 0000:00:1a.0, vendor_id: 0x8086, device_id: 0x1bb4
    [    5.315709] {1}[Hardware Error]:   class_code: 060400, bridge: secondary_status: 0x0000
    [   10.665201] nvidia: module verification failed: signature and/or required key missing - tainting kernel
    
  • Help Needed: Could the crash be caused by insufficient power supply after increasing GPU PL to 360W? Is it also related to PCIe bandwidth/compatibility issues? How to fix via kernel params, driver settings, hardware adjustments, or PL reduction?

Hi @wjxnow , thanks for reporting this issue.
Could you help to take a bug report once you see the crash, and upload it here?
You can run sudo nvidia-bug-report.sh once there is a crash to capture the bug report.

nvidia-bug-report.log (8.5 MB)

Hi there,

Thank you so much for your prompt reply and clear guidance!

I’ve followed your instructions: after the crash occurred, I ran sudo nvidia-bug-report.sh to capture the complete bug report, and the file (named nvidia-bug-report.log) has been uploaded here as requested.

Please let me know if you need any further information (like additional system logs, test results, or details about my hardware setup), and I’ll be happy to assist promptly.

Thanks again for your support!

Hi @wjxnow , sorry for the late reply.
I don’t see any NVIDIA driver related error logs in the bug report that you uploaded.
Could you kindly take the bug report immediately after you see the issue again?
Also, I see the currently installed version is 550.78, so could you try with the latest driver? Thanks.

Thank you for your reply. I will generate the bug report immediately when the issue occurs again. Also, I will update to the latest NVIDIA driver right away and check if the problem persists. Thanks for your guidance.

1 Like