17000000.gpu ga10b_pbdma_handle_intr_0_acquire:646 [ERR] semaphore acquire timeout!

est environment: jetpack 6.3(R36.3)
Test module: jetpack-Orin-AGX 64G/32G

Upon booting into the Ubuntu desktop, an error occurred
kernel log:

tegra login: [ 56.609727] nvgpu: 17000000.gpu ga10b_pbdma_handle_intr_0_acquire:646 [ERR] semaphore acquire timeout!
[ 56.609797] ga10b Channel Status - chip ga10b
[ 56.609799] ga10b ---------------------------
[ 56.609802] ga10b 508-ga10b, TSG: 3, pid 2512, thread name gnome-initial-s, refs: 2, deterministic: no, domain name: (no domain)
[ 56.609804] ga10b channel status: in use idle not busy
[ 56.609807] ga10b RAMFC: TOP: 80000020040080a0 PUT: 0020040080a0 GET: 0020040080a0 FETCH: 000000000000 HEADER: 2140006c COUNT: 00000000 SEMAPHORE: addr 002004010000 payload 0000000000000000 execute 00100001
[ 56.609810] ga10b
[ 56.609812] ga10b 509-ga10b, TSG: 2, pid 2092, thread name gnome-shell, refs: 8, deterministic: no, domain name: (no domain)
[ 56.609813] ga10b channel status: in use on_pbdma, pbdma_busy busy
[ 56.609815] ga10b RAMFC: TOP: 8000001ffd15d290 PUT: 001ffd15d338 GET: 001ffd15d290 FETCH: 000000000000 HEADER: 20001b04 COUNT: 11110003 SEMAPHORE: addr 001ffd350010 payload 0000000000000001 execute 02181000
[ 56.609817] ga10b
[ 56.609818] ga10b 510-ga10b, TSG: 1, pid 1635, thread name Xorg, refs: 2, deterministic: no, domain name: (no domain)
[ 56.609820] ga10b channel status: in use idle not busy
[ 56.609821] ga10b RAMFC: TOP: 8000002004033258 PUT: 002004033258 GET: 002004033258 FETCH: 000000000000 HEADER: 2140006c COUNT: 00000000 SEMAPHORE: addr 002004320000 payload 0000000000000000 execute 00000001
[ 56.609823] ga10b
[ 56.609824] ga10b 511-ga10b, TSG: 0, pid 1635, thread name Xorg, refs: 2, deterministic: no, domain name: (no domain)
[ 56.609825] ga10b channel status: in use on_eng not busy
[ 56.609827] ga10b RAMFC: TOP: 800000200404f168 PUT: 00200404f168 GET: 00200404f168 FETCH: 000000000000 HEADER: 2140006c COUNT: 00000000 SEMAPHORE: addr 002004020000 payload 0000000000000000 execute 00100001
[ 56.609828] ga10b
[ 56.609832] ga10b PBDMA Status - chip ga10b
[ 56.609833] ga10b -------------------------
[ 56.609835] ga10b pbdma 0:
[ 56.609838] ga10b id: 2 - [tsg] next_id: - -1 [channel] | status: valid
[ 56.609844] ga10b PBDMA_PUT 0000001ffd15d338 PBDMA_GET 0000001ffd15d290
[ 56.609850] ga10b GP_PUT 0000060b GP_GET 000005f0 FETCH 000005f0 HEADER 20001b04
[ 56.609855] ga10b HDR 200406c0 SHADOW0 0404f140 SHADOW1 00002820
[ 56.609857] ga10b pbdma 1:
[ 56.609858] ga10b id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 56.609864] ga10b PBDMA_PUT 000000357f79aeb8 PBDMA_GET 000000b990082164
[ 56.609869] ga10b GP_PUT 00000000 GP_GET 59e055e0 FETCH 00000000 HEADER c103dd1c
[ 56.609873] ga10b HDR 76f52f74 SHADOW0 209e8cf7 SHADOW1 07b620a6
[ 56.609876] ga10b pbdma 2:
[ 56.609877] ga10b id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 56.609882] ga10b PBDMA_PUT 00000047abe42464 PBDMA_GET 0000001c92fd4a50
[ 56.609888] ga10b GP_PUT 00000000 GP_GET da042b5f FETCH 00000000 HEADER c19797a0
[ 56.609892] ga10b HDR 87eefebc SHADOW0 ff7fa845 SHADOW1 7f256e82
[ 56.609894] ga10b pbdma 3:
[ 56.609895] ga10b id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 56.609901] ga10b PBDMA_PUT 00000085177a8a08 PBDMA_GET 00000092b7b46f98
[ 56.609906] ga10b GP_PUT 00000000 GP_GET 1cb2e9b3 FETCH 00000000 HEADER a1147fb4
[ 56.609910] ga10b HDR b256a688 SHADOW0 e733120a SHADOW1 0177b600
[ 56.609912] ga10b pbdma 4:
[ 56.609913] ga10b id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 56.609918] ga10b PBDMA_PUT 00000059c4313460 PBDMA_GET 000000ff3dbc3f20
[ 56.609924] ga10b GP_PUT 00000000 GP_GET ab757b8f FETCH 00000000 HEADER 018684b4
[ 56.609928] ga10b HDR 23ca344a SHADOW0 c93e7ce2 SHADOW1 a405abc9
[ 56.609930] ga10b pbdma 5:
[ 56.609931] ga10b id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 56.609936] ga10b PBDMA_PUT 0000004c72e877f0 PBDMA_GET 00000047fd087518
[ 56.609942] ga10b GP_PUT 00000000 GP_GET 8b90b544 FETCH 00000000 HEADER 614571d8
[ 56.609946] ga10b HDR 23fd72f1 SHADOW0 7fef42a2 SHADOW1 6af0600d
[ 56.609946] ga10b
[ 56.609952] ga10b ga10b eng 0:
[ 56.609953] ga10b id: 0 (tsg), next_id: -1 (channel), ctx status: valid
[ 56.609954] ga10b
[ 56.609957] ga10b ga10b eng 1:
[ 56.609958] ga10b id: 0 (tsg), next_id: -1 (channel), ctx status: valid
[ 56.609959] ga10b
[ 56.609962] ga10b ga10b eng 2:
[ 56.609963] ga10b id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 56.609964] ga10b
[ 56.609967] ga10b ga10b eng 3:
[ 56.609968] ga10b id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 56.609969] ga10b
[ 56.609972] ga10b ga10b eng 4:
[ 56.609973] ga10b id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 56.609974] ga10b
[ 56.609977] ga10b ga10b eng 5:
[ 56.609978] ga10b id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 56.609979] ga10b
[ 56.609980] ga10b
[ 56.609981] nvgpu: 17000000.gpu ga10b_pbdma_report_error:330 [ERR] pbdma_intr_0(0)= 0x04000000
[ 56.609987] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97 [ERR] Error reporting is not supported in this platform
[ 56.609995] nvgpu: 17000000.gpu nvgpu_set_err_notifier_locked:143 [ERR] error notifier set to 24 for ch 509 owned by gnome-shell

please see if this patch helps.

iff --git a/drivers/gpu/nvgpu/common/gr/gr_config.c b/drivers/gpu/nvgpu/common/gr/gr_config.c
index c86bba6..42cf687 100644
--- a/drivers/gpu/nvgpu/common/gr/gr_config.c
+++ b/drivers/gpu/nvgpu/common/gr/gr_config.c
@@ -686,8 +667,9 @@
 	 * logical id.
 	 */
 	config->gpc_tpc_physical_id_map = nvgpu_kzalloc(g,
-			nvgpu_safe_mult_u32((size_t)config->gpc_count,
-				sizeof(u32 *)));
+			nvgpu_safe_mult_u32(
+				nvgpu_safe_cast_u64_to_u32((size_t)config->max_gpc_count),
+				(u32)sizeof(u32 *)));
 	if (config->gpc_tpc_physical_id_map == NULL) {
 		nvgpu_err(g, "alloc gpc_tpc_physical_id_map failed");
 		goto clean_up_gpc_rop_config;
@@ -696,6 +678,11 @@
 	for (gpc_index = 0; gpc_index < config->gpc_count; gpc_index++) {
 		gpc_phys_id = nvgpu_grmgr_get_gr_gpc_phys_id(g,
 				cur_gr_instance, (u32)gpc_index);
+		if (gpc_phys_id >= config->max_gpc_count) {
+			nvgpu_err(g, "gpc_phys_id: %u is >= max_gpc_count: %u",
+					gpc_phys_id, config->max_gpc_count);
+			goto clean_up_gpc_tpc_physical_id_map_alloc_fail;
+		}
 		config->gpc_tpc_physical_id_map[gpc_phys_id] =
 			nvgpu_kzalloc(g, nvgpu_safe_mult_u32(
 				config->max_tpc_per_gpc_count, sizeof(u32)));


----->
I have already applied this patch, but the problem persists

Actually I don’t know what you are doing here because you shared a patch that we didn’t ask you to apply…

NvGPU

[NvGPU] slab-out-of-bounds in nvgpu_gr_config_init
https://forums.developer.nvidia.com/t/kernel-crashes-on-specific-orin-nx-modules/329611/17

Or tell me which patch corresponds to the link you provided

The link already indicates the one with “Known Issue on r36.4.7”

I noticed that the issue with NvGPU was similar to mine, so I applied this patch and haven’t found any other suspected patches.

不太確定我用中文說會不會清楚一點… 請你就apply我附上的那個patch.

你发的链接是页顶端,没有明显的指向补丁包项,而我打的那个补丁,是nvgpu 下的唯一看着跟我这个问题相似的。

這一個有提到… 那個fix裡面有

貼的那個不是頁頂端… 就是要請你看 “Known Issue on r36.4.7” 那一個section.

附件下载不了,

Need authorization?

No, I don’t think there is any extra authorization needed.