Jetson TX2 (4GB) crashes after SError (continued.)

Continuing the discussion from Jetson TX2 (4GB) crashes after SError and Machine check error as this thread was closed automatically.

We have some new information about this issue now: as you suggested we disabled the denver cores completely on some of our devices.

This indeed made the problem disappear.

However as I already said earlier: this is not a permanent solution.

  • So can you indicate next steps to further pinpoint the problem?
  • Are you aware of some issues that would explain this behavior?

Actually, that is the permanent solution as we disable that 2 cores by default on purpose.

Are you serious? So you are selling a 6 core device of which only 4 cores are usable? Is this documented anywhere?

There are some explanations in the release note document.

5.15 Increased Kernel Launch Latency on Denver 2 Cores

And indeed the default software does not enable these 2 cores by default. You could refer to some posts for this issue before.

Thank you very much for these links. We don’t have issues with using the denver cores though. We have issues with occasional SError related to using the denver cores.

We are using the following commands to setup clocking and power management:

  • nvpmodel -m 5
  • jetson_clocks
  • echo 0 > /sys/kernel/debug/tegra_cpufreq/B_CLUSTER/cc3/enable

The last command originates from CPU Throttling more on 4.9 than 4.4 - #15 by cquast.

The nvpmodel -m 5 command uses this custom configuration in /etc/nvpmodel.conf:

< POWER_MODEL ID=5 NAME=MAX_FREQ_ALL >
CPU_ONLINE CORE_1 1
CPU_ONLINE CORE_2 1
CPU_ONLINE CORE_3 1
CPU_ONLINE CORE_4 1
CPU_ONLINE CORE_5 1
CPU_A57 MIN_FREQ 2035200
CPU_A57 MAX_FREQ 2035200
CPU_DENVER MIN_FREQ 2035200
CPU_DENVER MAX_FREQ 2035200
GPU_POWER_CONTROL_ENABLE GPU_PWR_CNTL_EN on
GPU MIN_FREQ 0
GPU MAX_FREQ 1300500000
GPU_POWER_CONTROL_DISABLE GPU_PWR_CNTL_DIS auto
EMC MAX_FREQ 1866000000

Compared to the document you shared we are not using taskset to move a task to the denver cores but we are using /sys/fs/cgroup/cpuset/ as we want to move single threads only. We first create a new cpu-set for each denver core like this:

PATH_PREFIX=/sys/fs/cgroup/cpuset
mk_cpuset() {
    name=$1
    cpus=$2

    # This will create serveral pseudo-files for us to communicate through.
    mkdir ${PATH_PREFIX}/${name}

    # Let's first set the desired cpus.
    /bin/echo $cpus > ${PATH_PREFIX}/${name}/cpuset.cpus

    # Set the desired flags.
    /bin/echo 0 > ${PATH_PREFIX}/${name}/cpuset.mems

    # Restrict the scheduler from load balancing to nearby cpus
    /bin/echo 0 > ${PATH_PREFIX}/${name}/cpuset.sched_load_balance

    # The rest of the pseudo-files are populated with the desired options
    # Perhaps there exists other flags we can set to increase
    # performace.
}

mk_cpuset "verity-rt1" "1"
mk_cpuset "verity-rt2" "2"

Then we use those cpusets by moving threads to them like this. E.g.:

echo $tid > /sys/fs/cgroup/cpuset/verity-rt1/tasks

Can you spot anything in this flow that would explain these SErrors?

Hi,

Basically, we cannot directly share you why SEerrors happened by look through your commands.

If you want us to help check such issue, try to provide a method that can 100% reproduce your issue on latest BSP + NV devkit.

For TX2, the latest BSP is the rel-32.7.4 that was released recently.

And please be aware that we cannot guarantee when will the fix get ready. If they are is some files which is not open source, then you may need to wait until next release.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.