Two monitors connected to two Quadro RTX don't work in any capacity

The installer log is for 460.84, I didn’t see that it had been replaced by 465.31. People change things all the time between generating a bug report log and getting stack traces like these so it’s always best to confirm.

I’ll do some quick analysis and file a bug.

Edit: Filed internal bug number 3323148.

Thank you! You have my email. Please, include me to the bug.

Filed internal bug number 3323148.

Thanks!

Aaron, there is a specific feature in nvbugs to include external people. I won’t see any private (normal) messages in the internal thread, only those that are explicitly marked as ‘public’, and they will be marked with a red frame to warn the participants. Please, include me to the bug email list.

I don’t have permission to do that but I’ll note it in the bug.

Hi vaihoheso,

We have setup a TRX40 system for debugging this and can recreate the page faults in #10

In our testing there were no page faults if amd_iommu=off is appended to the kernel command line or if IOMMU is disabled on the SBIOS. X comes up as expected. Can you please confirm if you continue to see crashes on your system even after IOMMU is disabled.

For the page faults in #13, can you please provide more details. I found two options under system BIOS - OC → CPU Features → 1) IOMMU 2) SVM Mode. Are you disabling both SVM mode (AMD virtualization) and IOMMU.

We are restarting X to try and recreate the page fault in #13. Can you please provide your steps leading up to this issue. Please capture the nvidia-bug-report if you hit this again. If it hangs, please capture the kernel logs after the crash. A nvidia bug report after rebooting from the crash would help as well.

Thank you

We probably have different versions of BIOS, on mine IOMMU was tucked away deep in a different menu. Setting it to disabled didn’t make GPU Mosaic work. X started, but only on one screen, with the driver obviously not working correctly: X doesn’t erase background, leaving traces of moving windows.

Logs will follow.

With IOMMU disabled, X finally started on two monitors, I don’t know why, I am using the same configuration file. Maybe it was just colder in the room in the evening, and the push buffer initialization managed to pull through the RM’s timeout.

However, it doesn’t look like SLI mosaic. It’s not one big virtual screen. It’s two separate desktops.

Install any distro (I tried 6), install two Qaudro cards linked with NvLink bridge, connect a monitor to each, setup SLI Mosaic in nividia-settings. I didn’t do anything special besides that.

Thank you for the verification. This matches our results on testing with IOMMU disabled (disabled in the System BIOS or with amd_iommu=off added to the kernel command line). The failures in #10 are under active investigation

Have you run into #13 again after disabling IOMMU? Can you please report it here if you do.

Can you please attach the bug report for this scenario and a reference screenshot showing the two separate desktops.

There are no page faults or messages that push buffer initialization timed out with IOMMU disabled.

Is there a reason you want me to publish logs and screenshots on a public forum? You have my email, let’s exchange private screenshots and logs over email. If you don’t know my email, ask in nvbugs thread or ping Ian Williams.

Sorry about that. Please send them out to linux-bugs@nvidia.com. Thanks.

I’m having this same issue, I think. I have two RTX 4000s, one with an 8K PBP monitor attached to it, the other with four 4Ks, on an AMD Threadripper 3975WX. The same card, monitor, and X configuration works just fine on my Xeon W system. On the Threadripper system, I can start X okay on just one GPU, but as soon as I enable BaseMosaic, X fails to start and reports the same “Failed to initialize DMA” and “Failed to allocate push buffer” messages that vaihoheso reports. Passing amd_iommu=off to the kernel resolves the problem, but I would rather not have to do this. I’m currently running 470.42, but I’ve tried several recent versions with no difference in behavior. Please keep me posted if you come up with a resolution.

If you have to turn off iommu this means your mainboard or better said your bios is not fit for a sli/basemosaic setup. It blocks the gpus from talking to each other. Only other workaround is to fiddle with the pcie slot control config bits to turn off access control.
The driver can’t do anything about it.

The error message “failed to allocate push buffer” is about communication betwee CPU and just one GPU. Also, GPUs communicate via NvLink, not PICe.

Even if this were true, the driver could detect the problem at earlier stages, and percolate the information that SLI Mosaic is impossible to the user, instead of timing out waiting a respose from the push buffer.

@vaihoheso, unlike you I’m just using BaseMosaic, not SLI. RTX 4000 doesn’t support SLI. So I am indeed communicating over the PCIe bus. That said, I find it impossible to believe that my motherboard can’t handle this: I’m running one of AMD’s flagship line of CPUs on the single most widely-used motherboard for its socket type. Lots of people who buy systems like this use them for heavy virtualization. It’s ridiculous that they would sell such a board with an IOMMU that can’t handle something this basic. And even if that somehow were the case, I’d expect the Nvidia driver to handle the failure more gracefully than this.

Anyway, I’ve looked through my UEFI settings for anything that might be breaking communication and come up empty. The one setting I found that looked like it could cause trouble, “BME DMA Mitigation”, is already disabled by default. I tried changing the “Mmio Above 4G Limit” from “Auto” to the maximum setting of 43 bits, without effect.

In both cases “failed to allocate push buffer” means communication between CPU and GPU over PCIe.

I absolutely agree. It is ridiculous. What is even more frustrating is a stonewall regarding any technical details of the issue. NVIDIA changed a lot since I was there. NVIDIA used to be much more open and “human”. Compare it with NVIDIA’s response to the GTX 970 memory issue in 2015. Jonah himself came forward, publicly explaining the details.

There’s no “stonewall” in any way since this is simply about the capabilities of iommu in general, how it works and how bios vendors configure it. Mosaic needs p2p communication, virtualization needs device isolation.
Robert Crovella of Nvidia explained this a long time ago, check this thread:
https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/10

Let me repeat.

  1. “Failed to allocate push buffer” means that communication between CPU and GPU failed. It has zero to do with GPUs communicating via PCIe. This message means that the basic initialization of a GPU failed long before any p2p communication could have been tried to be established between GPUs.

  2. In the case of SLI Mosaic GPUs communicate over NvLink bus, which has nothing to do with PCIe and/or IOMMU.

  3. Even if what you are saying were true, the driver could have processed it gracefully and report to the user the reason, instead of requiring from users knocking every possible support door for months.