Unable to connect to DFP-0-3, kernel deadlock

Whenever I try to connect my custom EDID in xorg to DFP-0|1|2|3 on my grid card, xorg deadlocks, uses 100% cpu usage and thats it, whole system is unable to be rebooted, using DISPLAY=:0 xrandr freezes, nvidia-smi freezes. I tried every driver and I think I noticed a thread here saying this functionality is not supported without a grid license or something.

I looked at grid licensing but it does not mention anything about this DFP custom edid issue.

Basically all I want is to be able to connect my custom edid to DFP-0, DFP-1, DFP-2, DFP-3 without the whole system freezing.

Can anyone help me?

Without logs and config files and no info about the card used, not really.
So I start a little bit of guessing. Since you’re talking of ‘grid’ maybe those are Tesla M models. Do you know about gpumodeswitch?
On Keplers, this wasn’t necessary and attaching EDIDs worked OOTB, see:
[url]https://devtalk.nvidia.com/default/topic/991240/[/url]

GPU reports as in graphics mode, its a M10 pretty much and I say ‘pretty much’ for fear that my answer will be that my card is unsupported :(

root@x3:~# lspci -n | grep 10de
83:00.0 0300: 10de:13bd (rev a2)
84:00.0 0300: 10de:13bd (rev a2)
85:00.0 0300: 10de:13bd (rev a2)
86:00.0 0300: 10de:13bd (rev a2)

Might VT-D need to be enabled? Or something in bios?

I pmed you the logs. Thank you for taking a look if you get around to it.

EDIT: This is the thread i found https://gridforums.nvidia.com/default/topic/729/m6-rhel-6-6-unable-to-connect-connect-edid-monitor-to-valid-display-devices/, but in my case I dont get any errors. I get kernel lockup.

The problem in your case is multi-GPU vs. multi-Head. You’re trying to run one screen on each GPU and pop them together like they’re heads. Doesn’t work without a multi-GPU config which you don’t want (xinerama, no 3D accel) or is nvidia specific (mosaic, don’t know if that still works).
The normal setup would be to connect several fake monitors to one gpu and then maybe have one xorg.conf and xserver per gpu.
But if that info from the thread you found applies to the M10 as well, you’re out of luck without a vGPU setup.
To test, you should start with a xorg.conf from scratch only referencing one gpu with one screen and one monitor and no other option, see if that works and then try to add another monitor.
Another option, depending on your use case is to use just one huge monitor.

Thanks, on the 1 monitor 1 screen, xorg from scratch, in the server layout I tried leaving only 1 screen 1 device 1 monitor. Attached to the DFP-0, same thing, freeze happens. Maybe I fully need to clear it, even the unused options? Is there a way to disable the nvidia-auto-config, that seems to automatically find cards when X starts?

Using the one huge monitor approach is interesting but I am not sure how to create that. In my use case I need the screens to be separate and have 3D accel (so no need for xinerama/twinview/etc). I was thinking if it would be possible to create one large 16kx16k display/screen for example from 1 output, and then split it into 1920x1080 screens :0.1 :0.2, etc. But I do not think Xorg offers this option.

Please post just the new xorg.conf, I’ll take a look at it.

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    Option         "Xinerama" "0"
EndSection

Section "Files"
EndSection

Section "ServerFlags"
    Option "AutoAddDevices" "false"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "CRT-0"
    HorizSync       0.0 - 0.0
    VertRefresh     0.0
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    Option         "ConnectedMonitor" "DFP-0"
    Option         "CustomEDID" "DFP-0:/etc/X11/U2713HM.edid"
    BusID          "PCI:131:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Stereo" "0"
    Option         "SLI" "Off"
    Option         "MultiGPU" "Off"
    Option         "BaseMosaic" "off"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

If I change

Option         "ConnectedMonitor" "DFP-0"
Option         "CustomEDID" "DFP-0:/etc/X11/U2713HM.edid"

to

Option         "ConnectedMonitor" "CRT-0"
Option         "CustomEDID" "CRT-0:/etc/X11/U2713HM.edid"

Everything works without the deadlock and xrandr shows modes (otherwise xrandr deadlocks due to the deadlock in Xorg). But this card only has 1 CRT-0 and 4 DFP-[N]

Modesetting rules for DFPs are more strict than for CRTs. Probably

HorizSync       0.0 - 0.0
VertRefresh     0.0

is overriding values in EDID. Delete those, retry with DFP again.
Since you’re attaching an EDID you can most likely go without a monitor section at all.

Since I don’t know what the EDID contains, you can also go with single-resolution EDIDs:
https://github.com/akatrevorjay/edid-generator
There are the bin-files for common resolutions and the edid-generator can build bins from modelines you can create using e.g. cvt -r

you can also add

Option "ModeDebug" "true"

to the screen section and then see in Xorg.0.log which modes are evaluated and used or invalid and why.

So I tried this

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    Option         "Xinerama" "0"
EndSection

Section "Files"
EndSection

Section "ServerFlags"
    Option "AutoAddDevices" "false"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    Option         "ConnectedMonitor" "DFP-0"
    Option         "CustomEDID" "DFP-0:/etc/X11/1920x1080.bin"
    BusID          "PCI:134:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    DefaultDepth    24
    Option         "Stereo" "0"
    Option         "SLI" "Off"
    Option         "MultiGPU" "Off"
    Option         "BaseMosaic" "off"
    Option         "ModeDebug" "true"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Using the 1920x1080.bin I got from edid-generator, also tried my U2713HM.edid.

[   781.044] (II) NVIDIA(0): NVIDIA GPU GRID M40 (GM107GL-A) at PCI:134:0:0 (GPU-0)
[   781.044] (--) NVIDIA(0): Memory: 4194304 kBytes
[   781.044] (--) NVIDIA(0): VideoBIOS: 82.07.6d.00.0b
[   781.044] (II) NVIDIA(0): Detected PCI Express Link width: 16X
[   781.045] (--) NVIDIA(GPU-0): CRT-0: disconnected
[   781.045] (--) NVIDIA(GPU-0): CRT-0 Name Aliases:
[   781.045] (--) NVIDIA(GPU-0):   CRT
[   781.045] (--) NVIDIA(GPU-0):   CRT-0
[   781.045] (--) NVIDIA(GPU-0):   DPY-0
[   781.045] (--) NVIDIA(GPU-0):   VGA-0
[   781.045] (--) NVIDIA(GPU-0):   VGA-0
[   781.045] (--) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
[   781.045] (--) NVIDIA(GPU-0):
[   781.045] (--) NVIDIA(GPU-0): LNX Linux FHD (DFP-0): connected
[   781.045] (--) NVIDIA(GPU-0): LNX Linux FHD (DFP-0): Internal TMDS
[   781.045] (--) NVIDIA(GPU-0): LNX Linux FHD (DFP-0) Name Aliases:
[   781.045] (--) NVIDIA(GPU-0):   DFP
[   781.045] (--) NVIDIA(GPU-0):   DFP-0
[   781.045] (--) NVIDIA(GPU-0):   DPY-1
[   781.045] (--) NVIDIA(GPU-0):   DVI-I-0
[   781.045] (--) NVIDIA(GPU-0):   DPY-EDID-64866c46-1f56-ed22-cce3-db6d0656cceb
[   781.045] (--) NVIDIA(GPU-0):   DVI-I-0
[   781.045] (--) NVIDIA(GPU-0): LNX Linux FHD (DFP-0): 165.0 MHz maximum pixel clock
[   781.045] (--) NVIDIA(GPU-0):
[   781.045] (--) NVIDIA(GPU-0): --- EDID for LNX Linux FHD (DVI-I-0) ---
[   781.045] (--) NVIDIA(GPU-0): EDID Version                 : 1.3
[   781.045] (--) NVIDIA(GPU-0): Manufacturer                 : LNX
[   781.045] (--) NVIDIA(GPU-0): Monitor Name                 : LNX Linux FHD
[   781.045] (--) NVIDIA(GPU-0): Product ID                   : 0x0000
[   781.045] (--) NVIDIA(GPU-0): 32-bit Serial Number         : 0x00000000
[   781.045] (--) NVIDIA(GPU-0): Serial Number String         : Linux #0
[   781.045] (--) NVIDIA(GPU-0): Manufacture Date             : 2012, week 5
[   781.045] (--) NVIDIA(GPU-0): DPMS Capabilities            : Standby Suspend Active Off
[   781.045] (--) NVIDIA(GPU-0): Input Type                   : Analog
[   781.045] (--) NVIDIA(GPU-0): Prefer first detailed timing : Yes
[   781.046] (--) NVIDIA(GPU-0): Supports GTF                 : No
[   781.046] (--) NVIDIA(GPU-0): Maximum Image Size           : 500 mm x 280 mm
[   781.046] (--) NVIDIA(GPU-0): Valid HSync Range            : 66.0 kHz - 68.0 kHz
[   781.046] (--) NVIDIA(GPU-0): Valid VRefresh Range         : 59.0 Hz - 61.0 Hz
[   781.046] (--) NVIDIA(GPU-0): EDID maximum pixel clock     : 150.0 MHz
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): Standard Timings:
[   781.046] (--) NVIDIA(GPU-0):   1920 x 1080 @ 60 Hz
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): Detailed Timings:
[   781.046] (--) NVIDIA(GPU-0):   1920 x 1080 @ 60 Hz
[   781.046] (--) NVIDIA(GPU-0):     Pixel Clock      : 148.50 MHz
[   781.046] (--) NVIDIA(GPU-0):     HRes, HSyncStart : 1920, 2008
[   781.046] (--) NVIDIA(GPU-0):     HSyncEnd, HTotal : 2052, 2200
[   781.046] (--) NVIDIA(GPU-0):     VRes, VSyncStart : 1080, 1084
[   781.046] (--) NVIDIA(GPU-0):     VSyncEnd, VTotal : 1089, 1125
[   781.046] (--) NVIDIA(GPU-0):     H/V Polarity     : +/+
[   781.046] (--) NVIDIA(GPU-0):     Image Size       : 500 mm x 281 mm
[   781.046] (--) NVIDIA(GPU-0):     RGB 444 bpcs     : 8
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): --- End of EDID for LNX Linux FHD (DVI-I-0) ---
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): Raw EDID bytes:
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0):   00 ff ff ff ff ff ff 00  31 d8 00 00 00 00 00 00
[   781.046] (--) NVIDIA(GPU-0):   05 16 01 03 6d 32 1c 78  ea 5e c0 a4 59 4a 98 25
[   781.046] (--) NVIDIA(GPU-0):   20 50 54 00 00 00 d1 c0  01 01 01 01 01 01 01 01
[   781.046] (--) NVIDIA(GPU-0):   01 01 01 01 01 01 02 3a  80 18 71 38 2d 40 58 2c
[   781.046] (--) NVIDIA(GPU-0):   45 00 f4 19 11 00 00 1e  00 00 00 ff 00 4c 69 6e
[   781.046] (--) NVIDIA(GPU-0):   75 78 20 23 30 0a 20 20  20 20 00 00 00 fd 00 3b
[   781.046] (--) NVIDIA(GPU-0):   3d 42 44 0f 00 0a 20 20  20 20 20 20 00 00 00 fc
[   781.046] (--) NVIDIA(GPU-0):   00 4c 69 6e 75 78 20 46  48 44 0a 20 20 20 00 05
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): DFP-1: disconnected
[   781.046] (--) NVIDIA(GPU-0): DFP-1: Internal TMDS
[   781.046] (--) NVIDIA(GPU-0): DFP-1 Name Aliases:
[   781.046] (--) NVIDIA(GPU-0):   DFP
[   781.046] (--) NVIDIA(GPU-0):   DFP-1
[   781.046] (--) NVIDIA(GPU-0):   DPY-2
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-1
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-1
[   781.046] (--) NVIDIA(GPU-0): DFP-1: 165.0 MHz maximum pixel clock
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): DFP-2: disconnected
[   781.046] (--) NVIDIA(GPU-0): DFP-2: Internal TMDS
[   781.046] (--) NVIDIA(GPU-0): DFP-2 Name Aliases:
[   781.046] (--) NVIDIA(GPU-0):   DFP
[   781.046] (--) NVIDIA(GPU-0):   DFP-2
[   781.046] (--) NVIDIA(GPU-0):   DPY-3
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-2
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-2
[   781.046] (--) NVIDIA(GPU-0): DFP-2: 165.0 MHz maximum pixel clock
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (--) NVIDIA(GPU-0): DFP-3: disconnected
[   781.046] (--) NVIDIA(GPU-0): DFP-3: Internal TMDS
[   781.046] (--) NVIDIA(GPU-0): DFP-3 Name Aliases:
[   781.046] (--) NVIDIA(GPU-0):   DFP
[   781.046] (--) NVIDIA(GPU-0):   DFP-3
[   781.046] (--) NVIDIA(GPU-0):   DPY-4
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-3
[   781.046] (--) NVIDIA(GPU-0):   DVI-I-3
[   781.046] (--) NVIDIA(GPU-0): DFP-3: 165.0 MHz maximum pixel clock
[   781.046] (--) NVIDIA(GPU-0):
[   781.046] (II) NVIDIA(GPU-0):
[   781.046] (II) NVIDIA(GPU-0): --- Building ModePool for LNX Linux FHD (DFP-0) ---
[   781.046] (II) NVIDIA(GPU-0):   Validating Mode "1920x1080_60":
[   781.046] (II) NVIDIA(GPU-0):     Mode Source: EDID
[   781.046] (II) NVIDIA(GPU-0):     1920 x 1080 @ 60 Hz
[   781.046] (II) NVIDIA(GPU-0):       Pixel Clock      : 148.50 MHz
[   781.046] (II) NVIDIA(GPU-0):       HRes, HSyncStart : 1920, 2008
[   781.046] (II) NVIDIA(GPU-0):       HSyncEnd, HTotal : 2052, 2200
[   781.046] (II) NVIDIA(GPU-0):       VRes, VSyncStart : 1080, 1084
[   781.046] (II) NVIDIA(GPU-0):       VSyncEnd, VTotal : 1089, 1125
[   781.046] (II) NVIDIA(GPU-0):       Sync Polarity    : +H +V
[   781.046] (II) NVIDIA(GPU-0):     Viewport                 1920x1080+0+0
[   781.046] (II) NVIDIA(GPU-0):       Horizontal Taps        1
[   781.046] (II) NVIDIA(GPU-0):       Vertical Taps          1
[   781.046] (II) NVIDIA(GPU-0):       Base SuperSample       x1
[   781.046] (II) NVIDIA(GPU-0):       Base Depth             32
[   781.046] (II) NVIDIA(GPU-0):       Distributed Rendering  1
[   781.046] (II) NVIDIA(GPU-0):       Overlay Depth          32
[   781.046] (II) NVIDIA(GPU-0):     Mode "1920x1080_60" is valid.
[   781.046] (II) NVIDIA(GPU-0):
[   781.047] (II) NVIDIA(GPU-0):   Validating Mode "1920x1080_60":
[   781.047] (II) NVIDIA(GPU-0):     Mode Source: EDID
[   781.047] (II) NVIDIA(GPU-0):     1920 x 1080 @ 60 Hz
[   781.047] (II) NVIDIA(GPU-0):       Pixel Clock      : 148.50 MHz
[   781.047] (II) NVIDIA(GPU-0):       HRes, HSyncStart : 1920, 2008
[   781.047] (II) NVIDIA(GPU-0):       HSyncEnd, HTotal : 2052, 2200
[   781.047] (II) NVIDIA(GPU-0):       VRes, VSyncStart : 1080, 1084
[   781.047] (II) NVIDIA(GPU-0):       VSyncEnd, VTotal : 1089, 1125
[   781.047] (II) NVIDIA(GPU-0):       Sync Polarity    : +H +V
[   781.047] (II) NVIDIA(GPU-0):     Viewport                 1920x1080+0+0
[   781.047] (II) NVIDIA(GPU-0):       Horizontal Taps        1
[   781.047] (II) NVIDIA(GPU-0):       Vertical Taps          1
[   781.047] (II) NVIDIA(GPU-0):       Base SuperSample       x1
[   781.047] (II) NVIDIA(GPU-0):       Base Depth             32
[   781.047] (II) NVIDIA(GPU-0):       Distributed Rendering  1
[   781.047] (II) NVIDIA(GPU-0):       Overlay Depth          32
[   781.047] (II) NVIDIA(GPU-0):     Mode "1920x1080_60" is valid.

Same exact deadlock. Whole system is alive but zombied. Nothing works, cant kill Xorg, Xorg at 100% 1 cpu core.

[77451.852814] nvidia-modeset: WARNING: GPU:2: Lost display notification (0:0x00000000); continuing.
[77484.768514] NVRM: GPU at PCI:0000:86:00: GPU-4bfabc64-9458-38dd-b6a0-287523e9fa01
[77484.768558] NVRM: GPU Board Serial Number: 0121415042015
[77484.768565] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000000
[77492.769009] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000001
[77500.769532] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000002
[77508.770067] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000003
[77516.770573] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000004
[77524.771095] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000005
[77532.771618] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000006
[77540.772128] NVRM: Xid (PCI:0000:86:00): 16, Head 00000000 Count 00000007

Ok, the config generally works now, but now the driver bug surfaced.
XID 16
Recent nvidia drivers have bugs regarding modesetting, gets worse. It’s hard to tell which drivers support the M10, it’s never mentioned in the lists.
Please try if you can use either the 361, 364 or the 367 driver. I think around that time problems began.

I tried 340 from ubuntu 16.0.4.3 repos and this one from nvidia for 367,
http://us.download.nvidia.com/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run.

The same thing, should I go and try the 361 and 364 driver? I thought the 340 would rule that out?

I cant find the 361 and 364 that’s why, I found this but they are all x86 NVIDIA: World Leader in Artificial Intelligence Computing.

Here are all released drivers:
[url]https://http.download.nvidia.com/XFree86/Linux-x86_64/[/url]

Hi

I had the same issue, running RedHat 6. X proc takes 100% cpu along with irq/nvidia"nn".
With the most recent versions of the driver (384 or 387) the X server was behaving better , as it considered not receiving irqs and killed himself. It was at least possible to access the machine via network, reboot it or whatever.

But finally the issue was due to the cable I used (I connect to DVI screens, so I need adaptor) :if I use an active adapter DP → DVI DUAL-link, the screen was DFP-2 , and if I used a a passive DP → DVI adapter, the screen was DFP-0.

I finally found this whith nvidia settings on a working screen (I’m in a multi - seat X config with 3 graphic cards) and plug my non-working one.

I imagined DFP-n was refering to the physical DP slot, but I was wrong.

I needed to force the Connected monitor because at the time I deployed the X config, cabling was not finished (and screens not connected) and the settings chosen by the driver were wrong. so I needed ConnectedMonitor and CustomEDID option.

Hope it can help.

This might be helpful if its a problem with the EDID im using, and a problem when NOT supplying a EDID and letting NVIDIA pick whats connected to the display. These cards are GRID cards and they have no physical output ports. But xrandr and Xorg both pick up on the fact they have 1 crt and 4 DFP output ports (just impossible to physically connect anything to them).

Also this seems like a serious bug to me then in the driver, if its happening to you as well and you have actual physical ports, in my case I thought it was due to the lack of these physical ports I was getting the deadlock.

Also I say serious bug because the whole system goes zombied and you cannot do anything, not even reboot.