Error DMAR: DRHD: handling fault status reg xxx & [DMA Write] Request device [18:00.0] fault addr

hiyatsai · May 2, 2019, 2:27am

Hi,

I encountered a problem with “DMAR fault” on running tensorflow benchmarks (see Ref1 & Ref2)

Ref1: [SOLVED] DMAR errors related to Intel graphics / Kernel & Hardware / Arch Linux Forums
Ref2: [SOLVED] "kernel: DMAR: DRHD: handling fault status reg 3" / Kernel & Hardware / Arch Linux Forums

The following error occurs when the gpu call command is greater than >=2

Comand used
→ python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=32 --model=resnet152 --data_format=NCHW – variable_update=replicated --use_fp16 --num_epochs=1

[ 351.621348] DMAR: DRHD: handling fault status reg 202
[ 351.621351] DMAR: [DMA Write] Request device [18:00.0] fault addr 8f139000 [fault reason 05] PTE Write access is not set
[ 356.334510] dmar_fault: 20595 callbacks suppressed
[ 356.334511] DMAR: DRHD: handling fault status reg 2

The solution is also as described in Ref1 & Ref2
→ IOMMU has been enabled by default, and it’s not as “defaultable” as initial hoped.
→ kernel parameters set “intel_iommu=off”

But I currently have a requirement that “IOMMU must be On”.
Is there any other solution that can overcome the “DMAR fault” problem when IOMMU is not set to off?

Pltaform = Intel
Environment = Ubuntu 16.04
Nvidia Driver = 410.78
Cuda Version = 10.0
GPU card1: GTX 1070 *1
GPU card2: GTX 1070 *1
GPU card3: GTX 1050 *1
Docker image = tensorflow: 1.13.1-gpu-py3

generix · May 2, 2019, 6:58am

Try disabling access control on the pcie ports:
[url]https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1[/url]

hiyatsai · October 21, 2019, 3:10am

Hi , generix

According to this discussion , https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1 .

The conclusion here is to “turn off” the ACS CAPID attribute.

But in Our platform , the ACS CAPID attribute is “Read Only”, so it cannot be closed.

Is there any other suggested solution?

Thanks a lot!

generix · October 21, 2019, 9:41am

ACSCap is of course read-only since those are the capabilities of the slot. ACSCtl (control) has to be turned off.

hiyatsai · October 25, 2019, 2:43am

Hi , generix
First of all, thank you for your kind response.

According to your information ,
After I asked the vendor , and got the following response:

Hi, after internal discussion,
the CPU- SKX is designed to have ACS with extended CAP_ID= 0x0D meaning when the system PO,
the ACS capability is there already without BIOS programming. As for the PCH PCIe RP, the attribute is RWO (read/write one).
As a result, you could access ACSCTRL: Access Control Services Control Register to turn on/off the features based on your need instead of removing this CAPID from extend capability list.
Please let us know if any. Thanks

As mentioned above,the conclusion I got from the reply is
＂The PCH root port can be turn off the ACS through ACSCtl, but the CPU root port is not available.＂.

Sorry if there are other solutions available?
Thank you!

generix · October 25, 2019, 10:11am

I think you’re misinterpreting the answer, it’s saying exactly the same as I did, the ACS capability of the root port is read only, you’ll have to write to the ACS control register. You’ve posed the wrong question to the vendor.

The capability register is telling what it can do.
The control register is controlling/telling what it actually does.

You can see from this post
[url]https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/post/4766413/#4766413[/url]
that the ACSCtrl register is manipulated.

Topic		Replies	Views
Peer to peer DMA issue CUDA Programming and Performance	3	1893	January 30, 2018
Peer-to-peer DMA transfers bug under Intel Vt-d IOMMU virtualization Linux	5	3361	November 11, 2019
Dma_map_sg failed when call cudaHostAlloc on amd cpu and 4.15.112 linux kernel machine CUDA Programming and Performance cuda	4	589	June 1, 2023
Bad DMA writes when doing p2p memory transfers CUDA Setup and Installation	5	1303	August 10, 2017
When start DMA ,system restart, jetson nx Jetson Xavier NX kernel	4	1034	October 18, 2021
How to Carry out DMA transfer when sending data using PCIe from NVIDIA Root Port to a custom end point Jetson AGX Xavier pcie	16	2983	August 3, 2023
CUDA peer resources error when running on more than 8 K80s (AWS p2.16xlarge) CUDA Programming and Performance	15	7311	October 12, 2016
ValueError: this machine only has: ['/cpu:0', '/gpu:0'] General	4	2871	March 20, 2019
PCIe-AXI DMA error after migration from r23.2 to r24.2 Jetson TX1	18	2964	October 18, 2021
failure to set vgpu computing mode from prohibited to default Linux	11	3814	September 19, 2022

Error DMAR: DRHD: handling fault status reg xxx & [DMA Write] Request device [18:00.0] fault addr

According to your information , After I asked the vendor , and got the following response:

Related topics

According to your information ,
After I asked the vendor , and got the following response: