nvhost module error during high load test

Hello everyone.

Testing high load test by using gstreamer library, I faced test program is abort and an error log was output.

I guess that this error log indicate that kernel driver’s memory access error occurred.
so I modified kernel driver according to an error log and the high load test was passed.
“But I cannot be sure my patch is correct. Please check whether is it correct.”

[ High load test sequence ]
Using BSP 28.3

  1. main process start
  2. main process create sub processes ( 32 sub process created)
  3. each sub process use gstreamer library.(decode movie)
  4. main process stop and delete sub process
  5. sleep 10 seconds
  6. return step 2.(48 hours continuous)
  • After a while main process suddenly aborted, output system error log.

[ modify summary ]
・Protect resources by mutex_lock (& client_list_lock) for nvhost_module_update_rate function in nvhost_module_finalize_poweron.
Probably the nvhost_module_update_rate must be protected, but this code doesn’t.
Since it is protected in other functions, I think it is an mistake.

--- a/kernel/nvhost/drivers/video/tegra/host/nvhost_acm.c
+++ b/kernel/nvhost/drivers/video/tegra/host/nvhost_acm.c
@@ -1354,6 +1354,7 @@ static int nvhost_module_finalize_poweron(struct device *dev)
                goto out;
 
        /* set default EMC rate to zero */
+       mutex_lock(&client_list_lock);
        if (pdata->bwmgr_handle) {
                for (i = 0; i < NVHOST_MODULE_MAX_CLOCKS; i++) {
                        if (nvhost_module_emc_clock(&pdata->clocks[i])) {
@@ -1362,6 +1363,7 @@ static int nvhost_module_finalize_poweron(struct device *dev)
                        }
                }
        }
+       mutex_unlock(&client_list_lock);
 
        /* enable module interrupt if support available */
        if (pdata->module_irq)

[ error log ]

Unable to handle kernel paging request at virtual address dead000000000118
 pgd = ffffffc0ca10e000
 [dead000000000118] *pgd=00000001020f3003, *pud=00000001020f3003, *pmd=0000000000000000
 Internal error: Oops: 96000004 [#1] PREEMPT SMP
 Modules linked in: bluedroid_pm
 CPU: 5 PID: 10052 Comm: queue_back:src Not tainted #1
 Hardware name: quill (DT)
 task: ffffffc1cddb2580 ti: ffffffc111be4000 task.ti: ffffffc111be4000
 PC is at nvhost_module_update_rate+0xb4/0x2d0
 LR is at nvhost_module_runtime_resume+0x1d0/0x1f4
 pc : [<ffffffc0003b4ac0>] lr : [<ffffffc0003b665c>] pstate: 80000045
 sp : ffffffc111be7990
 x29: ffffffc111be7990 x28: ffffffc1ebf94700 
 x27: ffffffc0012c1478 x26: 0000000000000000 
 x25: ffffffc1eb8a0c00 x24: 0000000000000000 
 x23: 0000000000000000 x22: ffffffc0012c1540 
 x21: 0000000000000000 x20: ffffffc0012c1478 
 x19: 0000000000000001 x18: 0000000000000014 
 x17: 0000007f92a7fb20 x16: ffffffc00011a8fc 
 x15: 000a4a0b06071005 x14: 0000000000000000 
 x13: 0000000000000000 x12: 0000000000000005 
 x11: 0000000000000001 x10: 00000000000008b0 
 x9 : 0000000000000008 x8 : ffffffc0012c16d0 
 x7 : 0000000000000000 x6 : dead000000000108 
 x5 : dead000000000100 x4 : 0000000000000000 
 x3 : 000000000000004b x2 : 0000000000000000 
 x1 : 0000000000000001 x0 : 00000000000003e8 
 
 Process queue_back:src (pid: 10052, stack limit = 0xffffffc111be4020)
 Call trace:
 [<ffffffc0003b4ac0>] nvhost_module_update_rate+0xb4/0x2d0
 [<ffffffc0003b665c>] nvhost_module_runtime_resume+0x1d0/0x1f4
 [<ffffffc00057db00>] pm_generic_runtime_resume+0x28/0x38
 [<ffffffc000586f50>] pm_genpd_default_restore_state+0x40/0xac
 [<ffffffc0005883dc>] genpd_restore_dev.isra.10+0x1c/0x40
 [<ffffffc00058a27c>] pm_genpd_runtime_resume+0x10c/0x228
 [<ffffffc00057f880>] __rpm_callback+0x6c/0x94
 [<ffffffc00057f8c8>] rpm_callback+0x20/0x84
 [<ffffffc000580a94>] rpm_resume+0x374/0x614
 [<ffffffc000580d84>] __pm_runtime_resume+0x50/0x74
 [<ffffffc0003b54cc>] nvhost_module_busy+0x64/0x16c
 [<ffffffc0003c5548>] nvhost_ioctl_channel_submit+0x598/0x8b8
 [<ffffffc0003c612c>] nvhost_channelctl+0x448/0xd84
 [<ffffffc0001e6350>] do_vfs_ioctl+0x324/0x5e4
 [<ffffffc0001e6694>] SyS_ioctl+0x84/0x98
 [<ffffffc000084ff0>] el0_svc_naked+0x24/0x28
 ---[ end trace a28bff6370133ff3 ]---

hello yasuhiro_yamamoto,

I don’t think it’s a good operation to using mutex lock/unlock for nvhost driver.
may I know what’s your use-case for high loading test.
also, would you plan moving to the latest release (i.e. JetPack-4.2.2) for development.
thanks