I get mc-err on Jetson Xavier NX

HelloNewJAPAN · November 30, 2020, 8:13am

Hi,

My English isn’t so good so feel free to ask me if there is anything unclear.

I get mc-err on Jetson Xavier NX.

I get the following error in syslog when running on TensorRT + DLA + FP16.
The program can go through to the end without stopping, even if an error is output.

jetson kernel: [ 1498.949690] __arm_smmu_context_fault: 5952 callbacks suppressed
jetson kernel: [ 1498.949714] t19x-arm-smmu 12000000.iommu: 
Unhandled context fault: smmu0, iova=0x1fc7200000, fsynr=0x200012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.958813] t19x-arm-smmu 12000000.iommu: 
Unhandled context fault: smmu1, iova=0x1fc7201000, fsynr=0x100012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.963648] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x1fc769fd80, fsynr=0x200012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.968510] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0x1fc7903000, fsynr=0x100012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.973440] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x1fc7b84840, fsynr=0x200012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.978228] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0x1fc7df3000, fsynr=0x100012, cb=16, sid=81(0x51 - NVDLA0), pgd=275aaa003, pud=275aaa003, pmd=0, pte=0
jetson kernel: [ 1498.983176] mc-err: vpr base=0:c6000000, size=20, ctrl=3, override:(a01a8340, fcee10c1, 1, 0)
jetson kernel: [ 1498.987412] mc-err: (255) csw_dla0wra: MC request violates VPR requirements
jetson kernel: [ 1498.991446] mc-err:   status = 0x0ff740c0; addr = 0xffffffff00; hi_adr_reg=008
jetson kernel: [ 1498.995095] mc-err:   secure: yes, access-type: write
jetson kernel: [ 1498.998695] mc-err: mcerr: unknown intr source intstatus = 0x00000000, intstatus_1 = 0x00000000

・Device : Jetson Xavier NX (Dev kit)
・For INT8 + DLA, no error is output.
・Only certain models will print mc-err.

What is the cause of this error?

Thank you in advance.

Regards,

AastaLLL · November 30, 2020, 9:31am

Hi,

Thanks for reporting this.
We are checking this internally. Will update more information with you later.

AastaLLL · December 1, 2020, 4:18am

Hi,

The mc-err error indicates DLA is accessing memory without mapping or not allowed to access.
Please check if you manage the buffer correctly.

Thanks.

HelloNewJAPAN · January 27, 2021, 5:17am

Hi,
I always appreciate your consideration.

We have an update on this issue and will report it.

I have confirmed that this issue occurs by running the concat layer with DLA.

I have created a simple ONNX model for testing.
I will share this ONNX model.

concat_sigmoid.onnx (1.4 KB)

I get the following error in /var/log/syslog when running on [TensorRT + DLA + FP16.]

jetson kernel: [ 2504.063215] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x1ffebf9000, fsynr=0x200013, cb=16, sid=81(0x51 - NVDLA0), pgd=275a7b003, pud=275a7b003, pmd=216985003, pte=0
jetson kernel: [ 2504.063558] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0x1ffebf8000, fsynr=0x100013, cb=16, sid=81(0x51 - NVDLA0), pgd=275a7b003, pud=275a7b003, pmd=216985003, pte=0
jetson kernel: [ 2504.064216] mc-err: vpr base=0:c6000000, size=20, ctrl=3, override:(a01a8340, fcee10c1, 1, 0)
jetson kernel: [ 2504.064433] mc-err: (255) csw_dla0wra: MC request violates VPR requirements
jetson kernel: [ 2504.064622] mc-err:   status = 0x0ff740c0; addr = 0xffffffff00; hi_adr_reg=008
jetson kernel: [ 2504.064776] mc-err:   secure: yes, access-type: write
jetson kernel: [ 2504.064881] mc-err: mcerr: unknown intr source intstatus = 0x00000000, intstatus_1 = 0x00000000

What kind of things do you think could be the cause?

As of this moment,we have confirmed the following.

: This only happens when running the concat layer in DLA.
: This issue does not occur when running with INT8.
: Sometimes the program stops at runtime and will not proceed.
: If you run it on a GPU, this error will not occur.

I have asked a few questions about other issues in a different topic from this one.

This is a issue that can be solved by simply running the concat layer on the GPU.
As such, they have a lower priority than the questions I have in other topics.

This is off the topic, I have asked many questions about DLA, and this is the last one I have at this time.

We are constantly indebted to you for your diligence and skill in handling these matters.

Best regards,

AastaLLL · February 4, 2021, 6:18am

Hi,

Do you reproduce this issue with trtexc directly?
Thanks for sharing the model. We are going to test this and get back to you later.

Thanks.

HelloNewJAPAN · February 8, 2021, 6:56am

HI,
Yes, this issue is reproduced when running in trtexec.

All the logs I submitted are the output results when run with trtexec.

Thanks again for your help.

AastaLLL · February 23, 2021, 7:27am

Hi,

Thanks for your help.

We can reproduce this issue internally and now is checking with our internal team.
Will get back to you once we got a feedback.

Thanks.

AastaLLL · March 22, 2021, 8:23am

Hi,

Thanks for your patience.

The error is caused by the limitation of DLA in the JetPack4.5.
We don’t support concat operation with input channels that are not multiples of 16 (for fp16) and x32 (for int8).
Since your model has channel size=1, it causes some issue when accessing the memory.

The support is added in our next DLA library.
We also confirmed that there is no mc-error when running your model on next JetPack.

Please wait for our update for the new software.
Thanks.

HelloNewJAPAN · March 24, 2021, 7:48am

Hi,

We appreciate your help resolving the problem.

Thanks.