Error DMAR: DRHD: handling fault status reg xxx & [DMA Write] Request device [18:00.0] fault addr

Hi,

I encountered a problem with “DMAR fault” on running tensorflow benchmarks (see Ref1 & Ref2)

Ref1: [SOLVED] DMAR errors related to Intel graphics / Kernel & Hardware / Arch Linux Forums
Ref2: [SOLVED] "kernel: DMAR: DRHD: handling fault status reg 3" / Kernel & Hardware / Arch Linux Forums

The following error occurs when the gpu call command is greater than >=2

Comand used
→ python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=32 --model=resnet152 --data_format=NCHW – variable_update=replicated --use_fp16 --num_epochs=1

[ 351.621348] DMAR: DRHD: handling fault status reg 202
[ 351.621351] DMAR: [DMA Write] Request device [18:00.0] fault addr 8f139000 [fault reason 05] PTE Write access is not set
[ 356.334510] dmar_fault: 20595 callbacks suppressed
[ 356.334511] DMAR: DRHD: handling fault status reg 2

The solution is also as described in Ref1 & Ref2
→ IOMMU has been enabled by default, and it’s not as “defaultable” as initial hoped.
→ kernel parameters set “intel_iommu=off”

But I currently have a requirement that “IOMMU must be On”.
Is there any other solution that can overcome the “DMAR fault” problem when IOMMU is not set to off?

Pltaform = Intel
Environment = Ubuntu 16.04
Nvidia Driver = 410.78
Cuda Version = 10.0
GPU card1: GTX 1070 *1
GPU card2: GTX 1070 *1
GPU card3: GTX 1050 *1
Docker image = tensorflow: 1.13.1-gpu-py3

Try disabling access control on the pcie ports:
[url]https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1[/url]

Hi , generix

According to this discussion , https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1 .

The conclusion here is to “turn off” the ACS CAPID attribute.

But in Our platform , the ACS CAPID attribute is “Read Only”, so it cannot be closed.

Is there any other suggested solution?

Thanks a lot!

ACSCap is of course read-only since those are the capabilities of the slot. ACSCtl (control) has to be turned off.

Hi , generix
First of all, thank you for your kind response.

According to your information ,
After I asked the vendor , and got the following response:

Hi, after internal discussion,
the CPU- SKX is designed to have ACS with extended CAP_ID= 0x0D meaning when the system PO,
the ACS capability is there already without BIOS programming. As for the PCH PCIe RP, the attribute is RWO (read/write one).
As a result, you could access ACSCTRL: Access Control Services Control Register to turn on/off the features based on your need instead of removing this CAPID from extend capability list.
Please let us know if any. Thanks

As mentioned above,the conclusion I got from the reply is
"The PCH root port can be turn off the ACS through ACSCtl, but the CPU root port is not available.".

Sorry if there are other solutions available?
Thank you!

I think you’re misinterpreting the answer, it’s saying exactly the same as I did, the ACS capability of the root port is read only, you’ll have to write to the ACS control register. You’ve posed the wrong question to the vendor.

The capability register is telling what it can do.
The control register is controlling/telling what it actually does.

You can see from this post
[url]https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/post/4766413/#4766413[/url]
that the ACSCtrl register is manipulated.