What is "nvgpu: sm machine check err. gpc_id(0), tpc_id(0), offset(0)" mean?

I use 64G jetson AGX Orin, and version:
Linux version 5.10.104-tegra (root@78acacd053b2) #1 SMP PREEMPT Mon Oct 16 12:40:33 UTC 2023 tztek_version: [product:geac91, hardware:510jx0_r2_0_64G, jetpack:jp5.1.1_ga, soft:v1.0.0_GT2, type:release]

When I run camera V4L2 fetch process, it works fine. But suddenly kernel print:
nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365 [ERR] sm machine check err. gpc_id(0), tpc_id(0), offset(0)
Then camera can’t get frame anymore and vi print timeoout log. Here is my error.log:
13_31_42_error.log (3.4 MB)
So would you please tell me what this log mean?
Is Orin chip abnormal at that time?

Looks like NVCSI/VI didn’t unable to capture and retry to recovery.
Get the trace log if more information.

sudo bash -c "echo 1 > /sys/kernel/debug/tracing/tracing_on"
sudo bash -c "echo 30720 > /sys/kernel/debug/tracing/buffer_size_kb"
sudo bash -c "echo 1 > /sys/kernel/debug/tracing/events/tegra_rtcpu/enable"
sudo bash -c "echo 1 > /sys/kernel/debug/tracing/events/freertos/enable"
sudo bash -c "echo 3 > /sys/kernel/debug/camrtc/log-level"
sudo bash -c "echo 1 > /sys/kernel/debug/tracing/events/camera_common/enable"
sudo bash -c "echo > /sys/kernel/debug/tracing/trace"

Hi ShaneCCC
Thanks for reply.
But “sm machine check err” also reported on my another Orin board, which cause reboot immediatly:

So actually I want know what nvgpu error report mean? If it have some GPU hardware problem?


The better way here is try to provide the method to reproduce this issue.

Please try to reproduce your issue on NV devkit and share us how you did that. You can also cross check other modules to see if this is specific to one module only.

Hi WayneWWW,
It can’t reproduced on NV devkit, only happend in our Orin board which buy from vendor.
BTW I suspect it may inferenced by power. Is there any way to monitor power status continuedly?

You can try with Tegrastats utility to monitor the system status: Tegrastats Utility — Jetson Linux Developer Guide documentation (nvidia.com)

Thanks kayccc,
And would you please tell me which senario it will trigger GPU exception interrupt? Or is any documents can tell this detail?