Jetson Xavier NX GPU lib Report Error

Hi Wayne,

This is full dts file,Please help me review the doc
tegra194-p3668-all-p3509-0000.dts (305 KB)

Thanks

The way to disable display looks good.

Hi,

When NX Som in carrier board , we use uart debug, after boot login, the kernel crash like this :

jetson-nx login: [ 105.411753] nvgpu: 17000000.gv11b gk20a_ptimer_isr:53 [ERR] FECS_ERRCODE 0xbadf1020
[ 105.412248] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:83 [ERR] client timeout
[ 105.413060] nvgpu: 17000000.gv11b gk20a_ptimer_isr:64 [ERR] PRI timeout: ADR 0x00000000 WRITE DATA 0x00000000
[ 105.428090] nvgpu: 17000000.gv11b gk20a_ptimer_isr:64 [ERR] PRI timeout: ADR 0x00504728 READ DATA 0x00000000
[ 105.430658] nvgpu: 17000000.gv11b gk20a_ptimer_isr:53 [ERR] FECS_ERRCODE 0xbadf1020
[ 105.431700] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:83 [ERR] client timeout
[ 105.432638] ------------[ cut here ]------------
[ 105.432683] kernel BUG at /home/wayclouds/work/jetson_nx_sdk/source/public/kernel/nvgpu/drivers/gpu/nvgpu/hal/fuse/fuse_gm20b_fusa.c:54!
[ 105.432717] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 105.432741] Modules linked in: lzo_rle lzo_compress zram
[ 105.433149] nvgpu: 17000000.gv11b gk20a_ptimer_isr:64 [ERR] PRI timeout: ADR 0x00000000 WRITE DATA 0x00000800
[ 105.433581] snd_soc_tegra210_ope snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra186_dspk snd_soc_tegra210_mvc snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_mixer snd_soc_tegra210_dmic ofpart snd_soc_tegra210_afc snd_soc_tegra210_adx snd_soc_tegra210_sfc cmdlinepart snd_soc_tegra210_amx snd_soc_tegra210_i2s qspi_mtd mtd aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_spdif_tx loop snd_soc_tegra_machine_driver pwm_fan leds_gpio max77620_thermal snd_soc_tegra210_adsp snd_soc_tegra_utils snd_soc_simple_card_utils tegra_bpmp_thermal snd_soc_tegra210_ahub nvadsp tegra210_adma userspace_alert spi_tegra210_qspi spi_tegra114 nvgpu binfmt_misc ina3221 ip_tables x_tables
[ 105.486089] CPU: 2 PID: 1383 Comm: Xorg Not tainted 5.10.65-tegra #1
[ 105.492313] Hardware name: Unknown NVIDIA Jetson Xavier NX Developer Kit/NVIDIA Jetson Xavier NX Developer Kit, BIOS r34.1-975eef6 05/16/2022
[ 105.505464] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=–)
[ 105.512682] pc : gm20b_fuse_status_opt_tpc_gpc+0x88/0xa0 [nvgpu]
[ 105.518904] lr : gm20b_fuse_status_opt_tpc_gpc+0x28/0xa0 [nvgpu]
[ 105.523242] sp : ffff80001266ba80
[ 105.526906] x29: ffff80001266ba80 x28: ffff5a930e0ca070
[ 105.532172] x27: ffff5a9306c84f08 x26: 0000000000000000
[ 105.537947] x25: ffffa1dc026b7800 x24: ffffa1dc026696e8
[ 105.543197] x23: 0000000000000006 x22: ffff5a93109a9900
[ 105.548953] x21: 0000000000000000 x20: ffff5a9315700000
[ 105.554037] x19: 0000000000000000 x18: 0000000000000000
[ 105.559642] x17: 0000000000000000 x16: ffffa1dc458f038c
[ 105.564908] x15: 0000000000000243 x14: 00000000000004ac
[ 105.570667] x13: 0000000000000003 x12: 0000000000000500
[ 105.575822] x11: 0000000000000001 x10: 0000000000020000
[ 105.581626] x9 : 0000000000040000 x8 : 0000000000080000
[ 105.586847] x7 : 0000000000000000 x6 : 00000000bad00100
[ 105.592631] x5 : 0000000000022430 x4 : ffffa1dc026d1f28
[ 105.597790] x3 : 000000000000004e x2 : ffffa1dc45a76d10
[ 105.603394] x1 : a55acc4f569e0c00 x0 : 0000000000000000
[ 105.608463] Call trace:
[ 105.612345] gm20b_fuse_status_opt_tpc_gpc+0x88/0xa0 [nvgpu]
[ 105.617956] gm20b_gr_config_get_gpc_tpc_mask+0x40/0x70 [nvgpu]
[ 105.624050] nvgpu_gr_fs_state_init+0x2f4/0x380 [nvgpu]
[ 105.629133] nvgpu_gr_obj_ctx_alloc_golden_ctx_image+0x2d8/0x5e0 [nvgpu]
[ 105.635762] nvgpu_gr_obj_ctx_alloc+0x2e4/0x4f0 [nvgpu]
[ 105.640599] nvgpu_gr_setup_alloc_obj_ctx+0x168/0x400 [nvgpu]
[ 105.646188] gk20a_channel_ioctl+0xe8c/0x1200 [nvgpu]
[ 105.650135] __arm64_sys_ioctl+0xa8/0xf0
[ 105.653662] el0_svc_common.constprop.0+0x7c/0x1c0
[ 105.658697] do_el0_svc+0x34/0xa0
[ 105.661856] el0_svc+0x1c/0x30
[ 105.664831] el0_sync_handler+0xa8/0xb0
[ 105.668842] el0_sync+0x16c/0x180
[ 105.672025] Code: 97ff3762 a94153f3 a8c27bfd d65f03c0 (d4210000)
[ 105.677981] —[ end trace 5c5db81732599848 ]—
[ 105.682580] Kernel panic - not syncing: Oops - BUG: Fatal exception
[ 105.688810] SMP: stopping secondary CPUs
[ 105.693037] Kernel Offset: 0x21dc358c0000 from 0xffff800010000000
[ 105.699283] PHYS_OFFSET: 0xffffa56e00000000
[ 105.703397] CPU features: 0x8240002,03002a30
[ 105.707590] Memory Limit: none
[ 105.789573] —[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]—

but the same som in the dev kit is use normal, it is confuses me

Yes, it means possibly this is still a hardware design issue. Our team is checking if we can give out any hint.

Wayne,thanks for your help,If you have any doubts in the direction, please let me know as soon as possible.

Thank you again

Hi,Wayne

if we check the board design and doubt GPU voltage have problem, Next step we want to do constant frequency and constant voltage verification problem, how to make configure?

Thanks

You can set the max and min_freq under below path to same value, and gpu freq will be fixed value.

/sys/class/devfreq/57000000.gpu

Sorry, previous comment is for jetson nano.

For nx, it should be 17000000.gv11b under /sys/class/devfreq.

Thank you, we try it

Hi,
When reboot the board,the modify will dismiss.Can you tell me the the anther way to ensure the modification is saved in the firmware

Hi,

You can try to add below to your bpmp dtb. Convert the bpmp dtb back to dts and check the “regualtors” section.
vdd-cpu has REGULATOR_RAIL_VDD_MCPU, REGULATOR_RAIL_VDD_BCPU and REGULATOR_RAIL_VDD_GPU

regulator-min-microvolt = <850000>;
regulator-max-microvolt = <900000>;

This will set the power rail to 850mV - 900mV.

Hi,Wayne

We discover an strange performance about exception ,now we catch log from the uart debug port, and use the usb convert board to connect pc,sometimes we can boot success and login.but if the uart convert board plug in a usb hub,then the hub connect to PC,the system boot will crash every time.

Thanks

Is that because of your change in bpmp dtb?

No,we don’t change the bpmp dtb

Then this may be separate issue. Not related to this gpu error.

But in this case the crash kernel log is same with GPU lib Crash error

The better way is you just directly share the log… it is not very helpful by just sharing some comments…

The log also same with previous attachment
mecu boot (63.6 KB)

Hi,
Please check if power supply(voltage and current) to the board is stable. From the description it looks like when USB hub is present, it crashes always. With one more USB hub it may drain certain current. Probably it triggers the instability.

Hi, We have solve the problem, it is NX SOM Carrier board hardware design error ,the input voltage is not enough

Thanks for your help

2 Likes