Perf execution causes system hung

Hello Team,

I would like to share information about system hung issue observed while using perf tool on r32.7.4 BSP kernel.

Issue description:

The BSP kernel built using r32.7.4 release tag causes system hung while executing perf kmem subcommand. This issue is not noticed in r32.7.3 kernel.

Environment:

  • Hardware
    • Xavier NX Dev Kit
  • BSP
    • RFS is based on Jetson Linux r32.7.4
    • Kernel is based on r32.7.4 source with additional configs enabled.

Issue reproduction:

  1. Fetch r32.7.4 kernel source.
  2. In addition to default BSP kernel config, modify additional configs as mentioned below.
CONFIG_IRQ_WORK=y
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_KPROBES=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HW_PERF_EVENTS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
# CONFIG_LOCK_STAT_HISTOGRAM is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
CONFIG_KALLSYMS_ALL=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
# CONFIG_DEBUG_LOCKDEP is not set
  1. Build the kernel and flash to Xavier NX Dev Kit.
root@Nvidia-Xavier-1:~# cat /proc/version
Linux version 4.9.337-tegra (hoge@localhost) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Fri Jan 19 10:17:36 IST 2024
root@Nvidia-Xavier-1:~# 
  1. Build perf binary using below command from kernel source directory.
$ cd tools/perf
$ make 
  1. Run below perf subcommand on target console.
  2. The system gets hung and rebooted by watchdog after 120 secs. Please refer log snippet below.
root@Nvidia-Xavier-1:~# cat /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/id
387
root@Nvidia-Xavier-1:~# 
root@Nvidia-Xavier-1:~/perf/bsp# ./perf --version
perf version 4.9.337
root@Nvidia-Xavier-1:~/perf/bsp# 
root@Nvidia-Xavier-1:~/perf/bsp# ./perf kmem -s bytes record sleep 11
[  287.994400] hw perfevents: unable to set irq affinity (irq=85, cpu=4)
[  287.994599] hw perfevents: unable to set irq affinity (irq=86, cpu=5)
[  287.995163] hw perfevents: unable to set irq affinity (irq=85, cpu=4)
[  287.995304] hw perfevents: unable to set irq affinity (irq=86, cpu=5)
ÿâ
[0000.024] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.033] I> MB1 (prd-version: 1.5.1.9-t194-41334769-73a9b7ef)
[0000.038] I> Boot-mode: Coldboot
[0000.041] I> Chip revision : A02P
[0000.044] I> Bootrom patch version : 15 (correctly patched)
[0000.049] I> ATE fuse revision : 0x200
[0000.053] I> Ram repair fuse : 0x0
[0000.056] I> Ram Code : 0x0

Perf issue log:
perf_kmem_issue_20240119.txt (136.2 KB)

Observation/Analysis:

  1. It is observed that perf kmem event “kmem:kmem_cache_alloc_node” obtains tracepoint ID as 387.
  2. This tracepoint ID value of 387 is written to NV_PMSELR_EL0 register as decimal value “259” in carmel_uncore_event_init function.
  3. All the CPU cores goes to hung state when decimal value of 259 is written to NV_PMSELR_EL0 register in set_unit function called by carmel_uncore_pmu_enable function.
  4. The issue could be due to the initialization for carmel_uncore_pmu for tracepoint events.

Workaround

  1. As a workaround, added a condition check in carmel_uncore_event_init function and also return –ENOENT to avoid setting incorrect uncore unit and event for perf events other than hardware.
diff --git a/drivers/platform/tegra/tegra19_perf_uncore.c b/drivers/platform/tegra/tegra19_perf_uncore.c
index 4ba7bf881..6aa8b1e74 100644
--- a/drivers/platform/tegra/tegra19_perf_uncore.c
+++ b/drivers/platform/tegra/tegra19_perf_uncore.c
@@ -235,6 +235,12 @@ static int carmel_uncore_event_init(struct perf_event *event)
        if (event->cpu == -1)
                return -ENOENT;

+       // Avoid initialization for event types other than hardware
+       // Because, the unit and event are incorrectly set leading to
+       // UNDEFINED behavior when handling tracepoint events
+       if (event->attr.type != PERF_TYPE_HARDWARE)
+               return -ENOENT;
+
        config_unit = CARMEL_CONFIG_UNIT(event->attr.config);
        config_event = CARMEL_CONFIG_EVENT(event->attr.config);

  1. The issue doesn’t occur after applying the above patch. Please refer log snippet below.
root@Nvidia-Xavier-1:~# cat /sys/kernel/debug/tracing/events/kmem/kmem_cache_alloc_node/id
387
root@Nvidia-Xavier-1:~# 
root@Nvidia-Xavier-1:~/perf/bsp# ./perf kmem -s bytes record sleep 11
[  548.228915] hw perfevents: unable to set irq affinity (irq=85, cpu=4)
[  548.229128] hw perfevents: unable to set irq affinity (irq=86, cpu=5)
[  548.229642] hw perfevents: unable to set irq affinity (irq=85, cpu=4)
[  548.229788] hw perfevents: unable to set irq affinity (irq=86, cpu=5)
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 149.705 MB perf.data (1526847 samples) ]
root@Nvidia-Xavier-1:~/perf/bsp# 
root@Nvidia-Xavier-1:~/perf/bsp# 
root@Nvidia-Xavier-1:~/perf/bsp# ./perf --version
perf version 4.9.337
root@Nvidia-Xavier-1:~/perf/bsp# 

Perf uncore fix log:
perf_kmem_issue_uncore_fix_20240119.txt (109.4 KB)

Hi gowrisankar.kumar,

Please try the latest JetPack-5.1.2 on Xavier-NX.
Download link: https://developer.nvidia.com/embedded/jetson-linux-r3541

It’s not good comment unfortunately. This repot is kernel-4.9 issue what happens in nvidia driver as drivers/platform/tegra/tegra19_perf_uncore.c (not in vanilla kernel)

Could you please share to NVIDIA’s kernel develop team?

BTW, This topic has tools tag. But, It’s correct kernel tag.

Thank you.

Hi,
Thanks for the sharing. We will check it with our teams.

1 Like