DriveAGX Thor ptp4l fails on mgbe3_0 when DriveWorks sample_camera started to stream video

DRIVE OS Version: 7.0.3

Issue Description:

After power on and reboot the system starts without issue and the mgbe3_0 interface acquires time-sync from an external gPTP master device connected to the DriveAGX Thor unit.

Monitoring ptp4l shows normal behavior and the port stays in SLAVE mode and synchronizes to the external master without issue.

Subsequently launching DW sample_camera with a simple rig file that enables a single camera the log output of ptp4l for the nv_ptp4l_slave_mgbe3_0.service shows a failure and the port transitions to FAULTY state and synchronization stops.

Mar 24 20:51:52 tegra-ubuntu ptp4l[2460]: [376.257] failed to step clock: No space left on device
Mar 24 20:51:52 tegra-ubuntu ptp4l[2460]: [376.507] failed to step clock: No space left on device
Mar 24 20:51:52 tegra-ubuntu ptp4l[2460]: [376.616] missing timestamp on transmitted peer delay request
Mar 24 20:51:52 tegra-ubuntu ptp4l[2460]: [376.616] port 1 (mgbe3_0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

Shortly aftewards recovery is attempted:

Mar 24 20:52:08 tegra-ubuntu ptp4l[2460]: [392.908] port 1 (mgbe3_0): FAULTY to SLAVE on INIT_COMPLETE
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.121] failed to step clock: No space left on device

Subsequently it fails again:

Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.373] failed to step clock: No space left on device
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.626] failed to step clock: No space left on device
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.752] rms 2905807845 max 2905808269 freq +12335 +/-  69
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.879] failed to step clock: No space left on device
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.972] missing timestamp on transmitted peer delay request
Mar 24 20:52:09 tegra-ubuntu ptp4l[2460]: [393.972] port 1 (mgbe3_0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

This continues to repeat. It continues even if sample_camera is stopped. The only recovery is to shutdown, power-off, and restart the system.

Is this a known issue with 7.0.3?

I do not have an explanation for why the error message “No space left on device” occurs.

Inspecting the free-space on mounted drives does not show any full drives and thus I don’t really know what “device” that error might refer to. Why that error would be returned from an operation on a Ethernet device or its associated PHC is a mystery to me.

I suppose it could be related to this issue triggering ptp4l to try and slew the clock or step the clock some huge amount as implied by the rms stats reported:

Mar 24 21:05:16 tegra-ubuntu ptp4l[2460]: [1177.048] missing timestamp on transmitted peer delay request
Mar 24 21:05:16 tegra-ubuntu ptp4l[2460]: [1177.048] port 1 (mgbe3_0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Mar 24 21:05:32 tegra-ubuntu ptp4l[2460]: [1193.252] port 1 (mgbe3_0): FAULTY to SLAVE on INIT_COMPLETE
Mar 24 21:05:32 tegra-ubuntu ptp4l[2460]: [1193.307] rms 2905813907 max 2905815525 freq +12169 +/-  45
Mar 24 21:05:32 tegra-ubuntu ptp4l[2460]: [1193.420] failed to step clock: No space left on device
Mar 24 21:05:32 tegra-ubuntu ptp4l[2460]: [1193.645] failed to step clock: No space left on device
Mar 24 21:05:33 tegra-ubuntu ptp4l[2460]: [1193.871] failed to step clock: No space left on device
Mar 24 21:05:33 tegra-ubuntu ptp4l[2460]: [1194.095] failed to step clock: No space left on device
Mar 24 21:05:33 tegra-ubuntu ptp4l[2460]: [1194.252] missing timestamp on transmitted peer delay request
Mar 24 21:05:33 tegra-ubuntu ptp4l[2460]: [1194.252] port 1 (mgbe3_0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Mar 24 21:05:49 tegra-ubuntu ptp4l[2460]: [1210.540] port 1 (mgbe3_0): FAULTY to SLAVE on INIT_COMPLETE
Mar 24 21:05:49 tegra-ubuntu ptp4l[2460]: [1210.634] failed to step clock: No space left on device
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1210.859] failed to step clock: No space left on device
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.084] failed to step clock: No space left on device
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.197] rms 2905815503 max 2905815525 freq +12208 +/-  27
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.309] failed to step clock: No space left on device
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.534] failed to step clock: No space left on device
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.612] missing timestamp on transmitted peer delay request
Mar 24 21:05:50 tegra-ubuntu ptp4l[2460]: [1211.612] port 1 (mgbe3_0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

Please suggest how to resolve or troubleshoot this issue.

And to add some additional information:

I run nvsipl_camera headless (e.g. omit -d 1) to stream from that very same camera, the symptom does not occur.

If I run sample_camera headless with –offscreen=2 the symptom still occurs.

I also note that when streaming is started with sample_camera the console logs a SMMU fault and then outputs a long series of gibbrish lines:

��[  312.064741] [TS:319910265234] cdi-mgr sipl_devblk_12: cdi_mgr_gpio_config: unknown gpio idx 7
[  343.179727] [TS:351025251319] cdi-mgr sipl_devblk_12: cdi_mgr_gpio_config: unknown gpio idx 7
[  419.675205] [TS:427520729146] cdi-mgr sipl_devblk_12: cdi_mgr_gpio_config: unknown gpio idx 7
[  504.089346] [TS:511934869491] cdi-mgr sipl_devblk_12: cdi_mgr_gpio_config: unknown gpio idx 7
��!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!! SMMU FAULT OCURRED !!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Event queue Interrupt triggered on Smmu Instance: 0x0
Translation Fault occurred:
StreamId: 0x2e01
Write access
Stage 2 fault occurred
Failed to fetch the Input Address
Input Address is: 0xffbec20000
IPA: 0xffbec20000
STAG: 0x0
Stall: 0x0
Data
Unprivileged
NSIPA: 0x0

Tegra Report Error
Reporter ID: : 0x8287
Error code: : 0x20002e01
Error attribute: : 0x0
Error report timestamp: : 0x7761ca2f31
���MC Fault - EMEM address decode error
Error Status: 0x100100bd
Client ID: 0xbd
Clien��fata��t Na��l er��me :��ror �� csw��cpu:��_vif��0 0
fat��nw                                                   ��alco��
��al e��Clie��rror��nt S�� cpu��wGro��:1 0��up :��
fa�� vif��tal ��alco��erro��n
S��r cp��ecur��u:2 ��e : ��0
f��No
��atal��Acce�� err��ss-T��or c��ype ��pu:3��: Wr�� 0
��fata��
        Low��l er��er A��ror ��ddre��cpu:��ss: ��4 0
fat��fff0��al e��00                                 ��0xff��
��rror��High�� cpu��er A��:5 0��ddre��
fa��ss: ��tal ��0xff��erro��ff
��r cp���!!��u:6 ��!!!!��0
f��!!!!��atal��!!!!�� err��!!!!��or c��!!!!��pu:7��!!!!�� 0
��fata��
        !!!��l er��!! S��ror ��MMU ��cpu:��FAUL��8 0
fat��URRE��al e��D !!��rror��!!                     ��T OC��
�� cpu��!!!!��:9 0��!!!!��
fa��!!!!��tal ��!!!!��erro��!!!!��r cp��!!!!��u:10��!!!!�� 0
��!
E��fata��vent��l er�� que��ror ��ue I��cpu:��nter��11 0��rupt��
fa�� tri��tal ��gger��erro��ed o��r cp��n Sm��u:12��mu I�� 0
��nsta��fata��nce:��l er�� 0x0��ror ��
Tr��cpu:��ansl��13 0��atio��
fa��n Fa��tal ��ult ��erro��occu��r cp��rred��u:14��:
S�� 0
��cpu:��fata��mId:��l er�� 0x2��ror ��e01
        Wri��15 0��te a��
pr��cces��oces��s

The gibberish stops when sample_camera is stopped.

But then the console is constantly logging this:

[ ��rror�� 506�� cpu��.697��:5 1��968]��                               ��cmd ��
no�� [TS��n fa��:514��tal ��5434��erro��9177��r cp��0] n��u:6 ��veth��1
n��erne��on f��t a8��atal��08b1�� err��0000��or c��.eth��pu:7��erne�� 1
��t: I��non ��VC w��fata��rite��l er�� wit��ror ��h le��cpu:��n 18��8 1
non��et -�� fat��28 c��al e��md 5��rror�� ioc�� cpu��tlcm��:9 1��d 42����24 r��
��tal ��[  5��erro��06.7��r cp��0596��u:10��8] [�� 1
��TS:5��non ��1455��fata��1491��l er��650]��ror �� nve��cpu:��ther��11 1��net ��
no��a808��n fa��b100��tal ��00.e��erro��ther��r cp��net:��u:12�� IVC�� 1
�� wri��non ��te w��fata��ith ��l er��len ��ror ��1824��cpu:�� ret��13 1�� -28��
no�� cmd��n fa�� 5 i��tal ��octl��erro��cmd ��r cp��42 f��u:14��aile�� 1
[ ��non �� 506��fata��.713��l er��967]��ror �� [TS��cpu:��:514��15 1��5594��
[  506.721970] [TS:514567493631] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.817972] [TS:514663496106] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.833968] [TS:514679491550] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.841967] [TS:514687491356] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.857969] [TS:514703493329] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.865971] [TS:514711494634] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.949967] [TS:514795491095] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.957969] [TS:514803492558] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.973968] [TS:514819492327] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.981969] [TS:514827492716] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  506.997968] [TS:514843492069] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.005968] [TS:514851492199] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.021972] [TS:514867496321] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.029968] [TS:514875492386] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.045968] [TS:514891492359] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.053969] [TS:514899492656] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.069969] [TS:514915493315] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.077970] [TS:514923493972] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.093968] [TS:514939492048] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.101968] [TS:514947491761] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.117968] [TS:514963492253] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.125970] [TS:514971493475] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.141968] [TS:514987492365] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.229979] [TS:515075502787] nvethernet a808b10000.ethernet: Failed to get TSC Struct info from registers
[  507.289969] [TS:515135492497] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.297968] [TS:515143492293] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.313968] [TS:515159492173] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.321969] [TS:515167492756] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.337969] [TS:515183492803] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.345970] [TS:515191493525] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.353970] [TS:515199493858] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failedn code -28
[  507.361970] [TS:515207493775] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.377970] [TS:515223494089] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.385969] [TS:515231493376] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 42 failed
[  507.402024] [TS:515247547459] nvethernet a808b10000.ethernet: IVC write with len 1824 ret -28 cmd 5 ioctlcmd 18 failed

What is observation with offscreen=1?

external gPTP master device is connected at mgbe3_0 interface? Could you share the used ptp4l,sample_camera commands and complete logs of your observations.


The symptom is the same.

So essentially the SMMU fault and the gibberish to the console occur if I use any DriveWorks based tool to start the image sensor.

Yes. As per Platform Time Synchronization — NVIDIA DriveOS 7.0.3 Linux SDK Developer Guide for P3960 TS3 the gPTP Master is upstream of R_SWITCH-1:P1 on J5:P3

For what ptp4l? The ptp4l running on the DriveAGX is the DriveOS 7.0 provided systemd service unit.

To be clear, this is not a problem with ptp4l. The symptom observed with ptp4l is “collateral damage” caused by activating an image sensor with DriveWorks. I note it because it appears that some activity in the system triggers the SMMU fault which is reported. This cascades to some failure which impacts the mgbe3 Ethernet device.

When running nvsipl_camera which DOES NOT cause the symptom the command is:
nvsipl_camera -c V1SIM728MPRU4120ND1_5Des_CPHY_x4 -m “0 0 0 0x1000 0” -s

When running sample_camera which DOES cause the symptom the command is:
/usr/local/driveworks/bin/sample_camera --rig=valeo-imx728-x1.json --offscreen=2

And the rig file is:

{
  "rig": {
    "sensors": [
      {
        "name": "camera:cam2:link3",
        "protocol": "camera.gmsl",
        "parameter": "camera-name=V1SIM728MPRU4120ND1,interface=csi-ef,link=3,deserializer=MAX96712_Fusa_nv,CPHY-mode=1,disable-custintf=1,skip-eeprom=1,output-format=processed+raw,async-record=1,format=h264,fifo-size=4,file-buffer-size=4194304",
        "nominalSensor2Rig_FLU": {
          "roll-pitch-yaw": [
            0,
            0,
            0
          ],
          "t": [
            0,
            0,
            0
          ]
        }
      }
    ],
    "vehicle": {
      "valid": false
    }
  },
  "version": 2
}

What exactly beyond what I have posted above would you like? What constitutes “complete logs”?

Here is my best shot at capturing all of the outputs related to symptoms for this issue.

2026-04-25-sample_cam-smmu_fault-ptp4l_fail-test_logs.zip (232.3 KB)

It seems that you may use tcu_muxer_raw_log_dump.py with the raw console uart log captured in the file tio-thor.txt from the attached archive and get the individual outputs in less scrambled form.

I hope that the log output is clear enough for you to identify the issue and what causes the SMMU fault.

@SivaRamaKrishnaNV have I provided you the information you require?

If not, please restate anything more you require.

Thanks.

Dear @david.cattley ,
So there is no issue with DW camera sample when running standalone but SMMU fault gets triggered only when an gPTP master is connected at mgbe3_0 interface? Is my understanding correct?

That is not how I want you to interpret it (assigning some sort importance to the fact that gPTP on mgbe3_0 has some “causal” role ).

The gPTP instance on mgbe3_0 failing is just a collateral symptom.

I just went back and did the following:

  1. Cold boot system
  2. Stop all ptp4l and phc2sys services
  3. run simple test with sample_camera

The outcome is still that the SMMU fault occurs:

!!!
!!! SMMU FAULT OCURRED !!!
!!!
Event queue Interrupt triggered on Smmu Instance: 0x0
Translation Fault occurred:
StreamId: 0x2e01
Write access
Stage 2 fault occurred
Failed to fetch the Input Address
Input Address is: 0xffbec20000
IPA: 0xffbec20000
STAG: 0x0
Stall: 0x0
Data
Unprivileged
NSIPA: 0x0

Tegra Report Error
Reporter ID: : 0x8287
Error code: : 0x20002e01
Error attribute: : 0x0
Error report timestamp: : 0x41e9cc7980
ÿþMC Fault - EMEM address decode error
Error Status: 0x100100bd
Client ID: 0xbd
Client Name : csw_vifalconw
Client SwGroup : vifalcon
Secure : No
Access-Type : Write
Lower Address: 0xfffff000
Higher Address: 0xffff
ÿ!!!
!!! SMMU FAULT OCURRED !!!
!!!
Event queue Interrupt triggered on Smmu Instance: 0x0
Translation Fault occurred:
StreamId: 0x2e01
Write access
Stage 2 fault occurred
Failed to fetch the Input Address
Input Address is: 0xffbec20000
IPA: 0xffbec20000
STAG: 0x0
Stall: 0x0
Data
Unprivileged
NSIPA: 0x0

Tegra Report Error
Reporter ID: : 0x8287
Error code: : 0x20002e01
Error attribute: : 0x0
Error report timestamp: : 0x41f36a5e0b
Translation Fault occurred:
StreamId: 0x2e01
Write access
Stage 2 fault occurred
Failed to fetch the Input Address
Input Address is: 0xffbec20020
IPA: 0xffbec20000
STAG: 0x0
Stall: 0x0
Data
Unprivileged
NSIPA: 0x0

Tegra Report Error
Reporter ID: : 0x8287
Error code: : 0x20002e01
Error attribute: : 0x0
Error report timestamp: : 0x41f90e7700
Translation Fault occurred:
StreamId: 0x2e01
Write access
Stage 2 fault occurred
Failed to fetch the Input Address
Input Address is: 0xffbec20020
IPA: 0xffbec20000
STAG: 0x0
Stall: 0x0
Data
Unprivileged
NSIPA: 0x0


The fault occuring has nothing to do with “running” ptp4l.

So let me restate this linearly.

If I stream a camera with nvsipl_camera (e.g. just NVSIPL and YUV to Display) there is no SMMU fault and not symptom is observed impacting ptp4l on mgbe3_0

If I stream a camera with sample_camera or recorder (e.g. DriveWorks, which has a more complex pipeline including likely encoder, etc.) the SMMU fault is always triggered. It so happens that after this SMMU fault is triggered that ptp4l on mgbe3_0 shows the collateral failure.

So ignore ptp4l and mgbe3_0 for now and explain the SMMU fault. Then consider if the reaction of the system to the SMMU fault might also include something that breaks the IOMMU mappings that support mgbe3_0 in some way.

Dear @david.cattley ,
Thank you for details. Then issue is with using camera with DW.
Could you check
camera-name=V1SIM728MPRU4120ND1,interface=csi-ef,CPHY-mode=1,link=3, async-record=1,file-buffer-size=16777216,deserializer=MAX96712_Fusa_nv,disable-custintf=1,output-format=raw+processed,format=mp4,skip-eeprom=1, nito-file=/path/to/V1SIM728MPRU4120ND1.nito ?

Assuming camera is connected to link 3 in second port from top.
Check changing links and ports update interface , link to see if it works. Use -offscreen=1 to see if it works.

Try reflashing target to see if it helps to fix any system state or SW issues.

@SivaRamaKrishnaNV

Please

The TRM does not document enough about the address space or how it is managed for me to independently (of your available internal information) decode the SMMU fault to identify the engine/initiator that triggered the fault or the logical target of the access. I’m assuming it is something in the register space of some SoC function.

Sure, I can go on a snipe-hunt varying parameters to camera.gmsl but in the end, you have a very reproducible fault that should not be happening. Understanding that would go a long way to guiding where this ongoing diagnosis goes.

Even some feedback on what you “suspect” might be happening would be helpful to guiding further variations in testing.

Ok, based on rooting around in the device tree sources I only find “hits” for key information like the StreamID in the tegra264 QNX DTSI files. None the less, this is rather interesting.

		iommu@810a000000 {
			smmu-instance-id = <0>;
			domains {
				mgbe0_domain0: mgbe0_domain0 {
					address-space = <&nios0_mgbe0_as0>;
					sid-list = <TEGRA_SID_MGBE0_VF0>;
				};
				mgbe1_domain0: mgbe1_domain0 {
					address-space = <&nios0_mgbe1_as0>;
					sid-list = <TEGRA_SID_MGBE1_VF0>;
				};
				mgbe2_domain0: mgbe2_domain0 {
					address-space = <&nios0_mgbe2_as0>;
					sid-list = <TEGRA_SID_MGBE2_VF0>;
				};
				mgbe3_domain0: mgbe3_domain0 {
					address-space = <&nios0_mgbe3_as0>;
					sid-list = <TEGRA_SID_MGBE3_VF0>;
				};
				mgbe0_domain1: mgbe0_domain1 {
					address-space = <&nios0_mgbe0_as1>;
					sid-list = <TEGRA_SID_MGBE0_VF1>;
				};
				mgbe1_domain1: mgbe1_domain1 {
					address-space = <&nios0_mgbe1_as1>;
					sid-list = <TEGRA_SID_MGBE1_VF1>;
				};
#ifndef NDAS_STORAGE_CONFIG
				mgbe2_domain1: mgbe2_domain1 {
					address-space = <&nios0_mgbe2_as1>;
					sid-list = <TEGRA_SID_MGBE2_VF1>;
				};
#endif
				mgbe3_domain1: mgbe3_domain1 {
					address-space = <&nios0_mgbe3_as1>;
					sid-list = <TEGRA_SID_MGBE3_VF1>;
				};
				nvethmgr_domain0: nvethmgr_domain0 {
					address-space = <&nios0_nvethmgr_as>;
					sid-list = <TEGRA_SID_MGBE0_VF4
							TEGRA_SID_MGBE1_VF4
							TEGRA_SID_MGBE2_VF4
							TEGRA_SID_MGBE3_VF4>;
				};
				vi0_iso_smmu_domain: vi0_smmu_domain {
					address-space = <&iso_vi_as>;
					sid-list = <0x2E01>;
				};
				vi1_iso_smmu_domain: vi1_smmu_domain {
					address-space = <&iso_vi_as>;
					sid-list = <0x2F01>;
				};
			};
			address-space-prop {
				nios0_mgbe0_as0: nios0_mgbe0_0 {
					iova-start = <0x0 0xE3000000>;
					iova-size = <0x0 0x3000000>;
				};
				nios0_mgbe1_as0: nios0_mgbe1_0 {
					iova-start = <0x0 0xE6000000>;
					iova-size = <0x0 0x4000000>;
				};
				nios0_mgbe2_as0: nios0_mgbe2_0 {
					iova-start = <0x0 0xEA000000>;
					iova-size = <0x0 0x3000000>;
				};
				nios0_mgbe3_as0: nios0_mgbe3_0 {
					iova-start = <0x0 0xED000000>;
					iova-size = <0x0 0x3000000>;
				};
				nios0_mgbe0_as1: nios0_mgbe0_1 {
					iova-start = <0x0 0xF2800000>;
					iova-size = <0x0 0x1400000>;
				};
				nios0_mgbe1_as1: nios0_mgbe1_1 {
					iova-start = <0x0 0xF3C00000>;
					iova-size = <0x0 0x1400000>;
				};
#ifndef NDAS_STORAGE_CONFIG
				nios0_mgbe2_as1: nios0_mgbe2_1 {
					iova-start = <0x0 0xF1400000>;
					iova-size = <0x0 0x1400000>;
				};
#endif
				nios0_mgbe3_as1: nios0_mgbe3_1 {
					iova-start = <0x0 0xF5000000>;
					iova-size = <0x0 0x1400000>;
				};
				nios0_nvethmgr_as: nios0_nvethmgr {
					iova-start = <0x0 0xF6400000>;
					iova-size = <0x0 0x9C00000>;
				};

				iso_vi_as: iso_vi {
					iova-start = <0x0 0x80000000>;
					iova-size = <0x4 0x3FFFFFFF>;
				};
			};
		};

SMMU instance 0 is managing both the offending initiator with SID 0x2E01

				vi0_iso_smmu_domain: vi0_smmu_domain {
					address-space = <&iso_vi_as>;
					sid-list = <0x2E01>;
				};

and the mgbe3 domains.

No idea what the actual function is behind the vi0 domain but this node gives a clue that it has something to do with NVCAP:

		nvcap: devm_nvcap {
			cmd = "iolauncher --prewait /dev/nvmap --prewait /dev/nvhost-ctrl --prewait /dev/nvhost-syncpoint --priority 23 -U 2230:2230,2000,2110,2100,2120,10100,10140,3000,3450,45112,2200 -A nonroot,deny,fork,spawn -A nonroot,allow,interruptevent,public_channel,pathspace -A nonroot,allow,NvMap/Interfaces:1-7 --dt /bus@0/tegra-hsp@8189100000 --dt /rtcpu@81893d0000 -A nonroot,allow,able=SMMU/SID:0x2E01 -A nonroot,allow,able=SMMU/SID:0x2F01 -A nonroot,allow,able=SMMU4/SID:0x0C01 -A nonroot,allow,able=SMMU4/SID:0x0D01 -A nonroot,allow,able=SMMU4/SID:0x2B01 -A nonroot,allow,able_create devm-nvcap";
			sc7 = "callback";
			heartbeat = "no";
			oneshot = "no";
			register_dvms;
		};

Which I can only assume is related somehow to “NVMedia Capture” of some sort.

Of course these clues are from the QNX DTSI and I don’t really find much in the Linux specific side but presume that the SID and IOMMU assocations would not change based on OS.

Same result. SMMU fault with same signature.

Same result. SMMU fault with same signature.

@SivaRamaKrishnaNV

Ok, I figured out

that vi0 is one of the MIPI-CSI input blocks in the SoC with vi1 being the other.

The fact observation that the SMMU fault does not occur with nvsipl_camera but does occur when using DriveWorks has me leaning towards a DW specific behavior regarding the NvSci Buffers / Sync / Stream that are being setup as compared to the much simpler pipeline in nvsipl_camera.

@SivaRamaKrishnaNV

I am able to reproduce the symptom in exactly the same way on another Drive AGX Thor unit.

I reflashed the target and the symptom persists.

Moving to a different port (CAM1) with a different deseralizer (MAX96724) had no change. The symptom persists.

Interestingly a colleague with an additional and different camera module, an Aumovio IMX728 (C2SIM728S3RU3030NC1), tested the scenario and found that the Valeo module (V1SIM728MPRU4120ND1) failed as it does for me and yet the Aumovio module did not show the symtpom. This was tested on multiple ports as well.

We continue to explore more widely to try and isolate other possible factors but this is quickly growing to a large scope.

The NV team has yet to comment on the SMMU fault root-cause-analysys and explain the connection with the collateral impact to the MGBE.

As previously noted, this configuration causes the symptom (SMMU fault, MGBE ptp4l failure).

However, my colleague has discovered that one change in the parameter string causes the symptom to dissapear.

Changing: output-format=raw+processed
to:output-format=processed

results in the SMMU fault no longer being triggered.

The full parameter string which results in no SMMU Fault symptom with sample_camera is thus:

camera-name=V1SIM728MPRU4120ND1,interface=csi-ef,CPHY-mode=1,link=3, async-record=1,file-buffer-size=16777216,deserializer=MAX96712_Fusa_nv,disable-custintf=1,output-format=processed,format=mp4,skip-eeprom=1, nito-file=/usr/share/camera/V1SIM728MPRU4120ND1.nito

Note that our target goal is to use recorder to capture LRAW format output so we do need the output-format=raw+processed to function correctly.

This observation suggests to me anyway that the issue is in how DW Camera support is setting up the buffer handling at the output of VI so that the RAW image stream can be shared as input to the ISP (for processed output) and accessed directly (for raw output).