gk20a error

Hello,
We connect two monitors on K1’s hdmi port and edp port and do hardware reset test.
It’s normal that these two monitors display ubuntu Desktop. But one time these two montors don’t display. (It happens once when we do 300 tests).
the demesg of this situation is recorded in attachment file log1116.txt.
It seems there is gpu error that kernel can’t boot successfully.
Is it a known issue and how can we fix it?

Thanks!
log1116.txt (106 KB)

Hi Tim2016,

Please add some debug print in following function to see whether the refcnt is being adding incorrectly.
int gk20a_pmu_enable_elpg(struct gk20a *g)

We will also investigate it internally. Thanks.

Hello,
I have added print message in gk20a_pmu_enable_elpg(),
The kernel now print the message I added when it boot successfuly.
Is it OK?
Do you want to check if the kernel will print the message I added when it boot fail like in log1116.txt?

Thanks!

Yes,

please print the message when error occurres

Hi Tim2016,

Any update from this? Is there any way to quickly reproduce issue?

Hello,
It didn’t happen again since I posted this topic.
It seems very low probability.
We will let you know if we have any update.

Thanks!

Hello,
These two montiors don’t display again after 225 tests.
The message through rs232 is recorded in attachment file “log1124.txt”.
“20171123.gk20a_pmu_enable_elpg()…” is the message I added in gk20a_pmu_enable_elpg().
The last message is “gk20a gk20a.0: gr_gk20a_wait_idle: timeout, ctxsw busy : 0, gr busy : 1” and no more other message printed after 10 mins.
Then I pressed keyboard in PC, the shell appear “ubuntu@tegra-ubuntu:~$” as line 883 in “log1124.txt”.

Thanks!

log1124.txt (52.1 KB)

Hi Tim2016,

Sorry for this error. Do you have a script to test this or a hardware method to trigger the reboot?

BTW, after hit the error, can it be resolved if you reboot again?

Hello,
I just press the hardware reset key to reset the system.
After hit the error, it can be resolved when I reboot again.

Thanks!

Hi Tim2016,

Can this error be reproduced by our devkit? Or are you using custom board?

Hello,
I am using custom board.

Thanks!

Could you reproduce this issue on devkit?

Hi,
Please try following patch to see if error is still

diff --git a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
index fe29beb..2d48114 100644
--- a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
+++ b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
@@ -3,7 +3,7 @@
  *
  * GK20A Graphics FIFO (gr host)
  *
- * Copyright (c) 2011-2015, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2011-2016, NVIDIA CORPORATION.  All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -1072,12 +1072,15 @@
 			   " deferring channel recovery to channel free");
 		/* clear interrupt */
 		gk20a_writel(g, fifo_intr_mmu_fault_id_r(), fault_id);
-		return verbose;
+		goto exit_enable;
 	}
 
 	/* resetting the engines and clearing the runlists is done in
 	   a separate function to allow deferred reset. */
 	fifo_gk20a_finish_mmu_fault_handling(g, fault_id);
+
+exit_enable:
+	gk20a_pmu_enable_elpg(g);
 	return verbose;
 }
 
diff --git a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
index 526eefb..b973338 100644
--- a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
+++ b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
@@ -365,7 +365,10 @@
 		int err = 0; \
 		if (support_gk20a_pmu()) \
 			err = gk20a_pmu_disable_elpg(g); \
-		if (err) return err; \
+		if (err) { \
+			gk20a_pmu_enable_elpg(g); \
+			return err; \
+		} \
 		err = func; \
 		if (support_gk20a_pmu()) \
 			gk20a_pmu_enable_elpg(g); \

Hello WayneWWW,
Thanks! We will try this patch.

Hello WayneWWW,
We use your patch and still see gk20a problem (It happened once when we test 700 times).
The dmesg is recorded in attachment file hdminodisplay_reset.txt.

Thanks!
hdminodisplay_reset.txt (92.9 KB)

Hi Tim2016,

There is one more patch that is fixing this issue.

Please also try this with the previous one.

---

diff --git a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
index 526eefb..838f877 100644
--- a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
+++ b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
@@ -363,11 +363,15 @@
 #define gr_gk20a_elpg_protected_call(g, func) \
 	({ \
 		int err = 0; \
-		if (support_gk20a_pmu()) \
+		if (support_gk20a_pmu(g->dev) && g->elpg_enabled) { \
 			err = gk20a_pmu_disable_elpg(g); \
-		if (err) return err; \
+			if (err) { \
+				gk20a_pmu_enable_elpg(g); \
+				return err; \
+			} \
+		} \
 		err = func; \
-		if (support_gk20a_pmu()) \
+		if (support_gk20a_pmu(g->dev) && g->elpg_enabled) \
 			gk20a_pmu_enable_elpg(g); \
 		err; \
 	})

Hello WayneWWW,
There will be some compiling error when I use this patch:
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c: In function ‘gk20a_intr_thread_stall’:
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gr_gk20a.h:366:7: error: too many arguments to function ‘support_gk20a_pmu’
if (support_gk20a_pmu(g->dev) && g->elpg_enabled) {
^
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c:565:3: note: in expansion of macro ‘gr_gk20a_elpg_protected_call’
gr_gk20a_elpg_protected_call(g, gk20a_gr_isr(g));
^
In file included from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c:51:0:
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.h:537:19: note: declared here
static inline int support_gk20a_pmu(void)
^
In file included from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/channel_gk20a.h:36:0,
from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/fifo_gk20a.h:24,
from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.h:40,
from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c:51:
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gr_gk20a.h:374:7: error: too many arguments to function ‘support_gk20a_pmu’
if (support_gk20a_pmu(g->dev) && g->elpg_enabled)
^
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c:565:3: note: in expansion of macro ‘gr_gk20a_elpg_protected_call’
gr_gk20a_elpg_protected_call(g, gk20a_gr_isr(g));
^
In file included from /home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.c:51:0:
/home/tim/pcpartner/temp/tk1_r21-4/kernel/drivers/gpu/nvgpu/gk20a/gk20a.h:537:19: note: declared here
static inline int support_gk20a_pmu(void)

Should I modify support_gk20a_pmu() too or …

Thanks!

Hi Tim2016,

We have a combination of those two patches. Please try following.

---

diff --git a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
index fe29beb..6b88dfd 100644
--- a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
+++ b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
@@ -3,7 +3,7 @@
  *
  * GK20A Graphics FIFO (gr host)
  *
- * Copyright (c) 2011-2015, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2011-2016, NVIDIA CORPORATION.  All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -963,7 +963,8 @@
 	g->fifo.deferred_reset_pending = false;
 
 	/* Disable ELPG */
-	gk20a_pmu_disable_elpg(g);
+	if (support_gk20a_pmu() && g->elpg_enabled)
+		gk20a_pmu_disable_elpg(g);
 
 	/* If we have recovery in progress, MMU fault id is invalid */
 	if (g->fifo.mmu_fault_engines) {
@@ -1072,12 +1073,16 @@
 			   " deferring channel recovery to channel free");
 		/* clear interrupt */
 		gk20a_writel(g, fifo_intr_mmu_fault_id_r(), fault_id);
-		return verbose;
+		goto exit_enable;
 	}
 
 	/* resetting the engines and clearing the runlists is done in
 	   a separate function to allow deferred reset. */
 	fifo_gk20a_finish_mmu_fault_handling(g, fault_id);
+
+exit_enable:
+	if (support_gk20a_pmu() && g->elpg_enabled)
+		gk20a_pmu_enable_elpg(g);
 	return verbose;
 }
 
diff --git a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
index 526eefb..0df9670 100644
--- a/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
+++ b/drivers/gpu/nvgpu/gk20a/gr_gk20a.h
@@ -363,11 +363,14 @@
 #define gr_gk20a_elpg_protected_call(g, func) \
 	({ \
 		int err = 0; \
-		if (support_gk20a_pmu()) \
+		if (support_gk20a_pmu() && g->elpg_enabled) \
 			err = gk20a_pmu_disable_elpg(g); \
-		if (err) return err; \
+		if (err) { \
+			gk20a_pmu_enable_elpg(g); \
+			return err; \
+		} \
 		err = func; \
-		if (support_gk20a_pmu()) \
+		if (support_gk20a_pmu() && g->elpg_enabled) \
 			gk20a_pmu_enable_elpg(g); \
 		err; \
 	})

BTW, what is your criteria of this test? How many times?

Hello WayneWWW,
We use the combination patch but still have the same problem as in “hdminodisplay_reset.txt”.
The latest log is attachment file “log_1214.txt”.
Is our “fifo_gk20a.c”, “gr_gk20a.h” correct?

Thanks!

gr_gk20a.h (10.5 KB)
fifo_gk20a.c (51.3 KB)
log_1214.txt (269 KB)

Hi Tim2016,

I found your error in the first line is:

ubuntu@tegra-ubuntu:~$ [ 16.492480] tegradc tegradc.0: Display timing doesn’t meet restrictions.

May I ask does this happen every time?