TX2 Clock speed/ bandwidth detection failure above about 65 deg C.

We hare having a problem with units not capturing video properly when the internal
temperature exceeds about 65 degrees C. Comparing the boot logs of a working and
problem startup the initial error seems to be in detecting “iso emc max clk” and
“max iso bandwidth”.

The snapshots are from an sdiff between to boot logs with the timestamps
removed. | means difference <,> means lines inserted. You may have to scroll
left and right to see the different logs.

The first difference (good on the left, bad on the right) is that the iso emc max clk
and bw aren’t calculated properly. I believe this is the root of the problem.

la/ptsa driver initialized.                                                                             la/ptsa driver initialized.
    pre_t19x_iso_plat_init(): iso emc max clk=1866000KHz                                               |    pre_t19x_iso_plat_init(): iso emc max clk=0KHz
    pre_t19x_iso_plat_init(): max_iso_bw=26870400KB                                                    |    pre_t19x_iso_plat_init(): max_iso_bw=0KB
    NET: Registered protocol family 2                                                                       NET: Registered protocol family 2

It’s possible this causes some of the camera subsystem to fail initialization.

misc tegra_camera_ctrl: tegra_camera_isomgr_register: some fields not in DT.                            misc tegra_camera_ctrl: tegra_camera_isomgr_register: some fields not in DT.
    misc tegra_camera_ctrl: tegra_camera_isomgr_register tpg_max_iso = 3916800KBs                           misc tegra_camera_ctrl: tegra_camera_isomgr_register tpg_max_iso = 3916800KBs
    misc tegra_camera_ctrl: tegra_camera_isomgr_register isp_iso_bw=0, vi_iso_bw=2250000, max_bw=391        misc tegra_camera_ctrl: tegra_camera_isomgr_register isp_iso_bw=0, vi_iso_bw=2250000, max_bw=391
                                                                                                       >    pre_t19x_iso_plat_register(): iso bandwidth 3916800KB is not available, client tegra_camera_ctrl
                                                                                                       >    misc tegra_camera_ctrl: tegra_camera_isomgr_register: unable to register to isomgr
                                                                                                       >    misc tegra_camera_ctrl: tegra_camera_probe: failed to register CAMERA as isomgr client
                                                                                                       >    tegra_camera_platform: probe of tegra-camera-platform failed with error -12
    tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes                                      tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes

And later

input: tegra-hda HDMI/DP,pcm=7 as /devices/3510000.hda/sound/card0/input1                               input: tegra-hda HDMI/DP,pcm=7 as /devices/3510000.hda/sound/card0/input1
                                                                                                       >    pre_t19x_iso_plat_register(): iso bandwidth 24576KB is not available, client ape_adma
                                                                                                       >    tegra_isomgr_adma_register: Failed to register adma isomgr client. err=-22
    OPE platform probe                                                                                      OPE platform probe

Finally, the probe of isp and nvsci fail, which I believe leads to a bad pointer and the abort

isp 15600000.isp: initialized                                                                           isp 15600000.isp: initialized
                                                                                                       >    isp 15600000.isp: isp_probe: failed
                                                                                                       >    isp: probe of 15600000.isp failed with error -22
    nvcsi 150c0000.nvcsi: initialized                                                                       nvcsi 150c0000.nvcsi: initialized
                                                                                                       >    nvcsi: probe of 150c0000.nvcsi failed with error -22
    gpio tegra-gpio-aon wake29 for gpio=56(FF:0)                                                            gpio tegra-gpio-aon wake29 for gpio=56(FF:0)
    gpio tegra-gpio-aon wake67 for gpio=57(FF:1)                                                            gpio tegra-gpio-aon wake67 for gpio=57(FF:1)
    gpio tegra-gpio-aon wake68 for gpio=58(FF:2)                                                            gpio tegra-gpio-aon wake68 for gpio=58(FF:2)
    input: gpio-keys as /devices/gpio-keys/input/input2                                                     input: gpio-keys as /devices/gpio-keys/input/input2
    tegra-vi4 15700000.vi: initialized                                                                      tegra-vi4 15700000.vi: initialized
    tegra-vi4 15700000.vi: subdev 150c0000.nvcsi--8 bound                                              |    Unable to handle kernel read from unreadable memory at virtual address 00000000
    tegra-vi4 15700000.vi: subdev 150c0000.nvcsi--7 bound

Any reason why these clock and bandwidth values may be calculated as 0 if the
temperature exceeded a specific value?

Note this only happens on some boards. A reset will NOT clear the condition,
a power cycle will.

Thanks,

Cary

hello cobrien,

just for confirmation, is this issue same as Topic 1052337
thanks

I believe this is the root cause of that issue.

Hi, to check if internal temperature over the limit, please list value of all thermal zones referring to topic https://devtalk.nvidia.com/default/topic/1032887

I was finally able to get a production unit to measure the temperatures.

I ran our application and some cpu loads (several parallel repeated ‘openssl speed aes-256-cbc’)
and was able to get the temperature up to the level where we see problems:

Mon May 20 11:24:09 EDT 2019
/sys/devices/virtual/thermal/thermal_zone3/temp 77
/sys/devices/virtual/thermal/thermal_zone1/temp 77
/sys/devices/virtual/thermal/thermal_zone6/temp 100
/sys/devices/virtual/thermal/thermal_zone4/temp 62
/sys/devices/virtual/thermal/thermal_zone2/temp 72
/sys/devices/virtual/thermal/thermal_zone0/temp 77
/sys/devices/virtual/thermal/thermal_zone7/temp 75
/sys/devices/virtual/thermal/thermal_zone5/temp 72

And when we reset, we get the expected problem with 0 clock/0 bw

[    0.946762] la/ptsa driver initialized.
[    0.946812] pre_t19x_iso_plat_init(): iso emc max clk=0KHz
[    0.946847] pre_t19x_iso_plat_init(): max_iso_bw=0KB
[    0.948222] NET: Registered protocol family 2

And a kernel panic during setting up the video capture subsystem.

[    4.298402] mmc1: hw tuning done ...
[    4.299835] tegra-vi4 15700000.vi: initialized
[    4.301907] Unable to handle kernel read from unreadable memory at virtual address 00000000

...
[    4.302053] Call trace:
[    4.302057] [<ffffff8008aeff94>] v4l2_async_notifier_register+0x134/0x1a0
[    4.302066] [<ffffff8008b0cb80>] tegra_vi_graph_init+0x210/0x290
[    4.302071] [<ffffff8008b069e8>] tegra_vi_media_controller_init+0x180/0x1b8
[    4.302084] [<ffffff800854e830>] tegra_vi4_probe+0x240/0x360
[    4.302096] [<ffffff8008759780>] platform_drv_probe+0x60/0xc8
[    4.302100] [<ffffff8008756d48>] driver_probe_device+0xd0/0x3f8
[    4.302103] [<ffffff8008757194>] __driver_attach+0x124/0x128
[    4.302106] [<ffffff800875487c>] bus_for_each_dev+0x74/0xb0
[    4.302109] [<ffffff8008756540>] driver_attach+0x30/0x40
[    4.302111] [<ffffff8008754e40>] driver_attach_async+0x20/0x60

Turning on a fan pointed to the device clears it right up.

Hi, looks that all thermal zones are normal, there should not be limit on performance. Are you testing on DevKit or custom board? How many suits did you test?

Could you please also check the strapping settings to confirm your design is same as that of reference board? You can check that based on Strapping chapter in OEM DG.

hello cobrien,

please also check DVFS table is enabled or not.
you might refer to Clock Frequency and Power Management chapter for more details.
thanks

This is on a custom carrier. It doesn’t seem to happen with every carrier board, but
we only have a few at this point. I am going to repeat the tests using the NVidia
dev kit carrier, as well as with a different TX2 module.

Hi cobrien, any update on test result with DevKit?

I was finally able to do tests with one of our carrier boards and
the TX2 evaluation module from NVidia.

I ran multiple encryption tests in parallel to drive cpu utilization
and temperature up to around 60 deg c on most of the monitor points
and then reset the unit.

On our carrier board the 0 values for clock and bandwidth appeared:

[    0.946812] pre_t19x_iso_plat_init(): iso emc max clk=0KHz
[    0.946847] pre_t19x_iso_plat_init(): max_iso_bw=0KB

And the startup crashes continued.

On the TX2 Eval board, these errors DID NOT occur, and the system
started without crashing.

We haven’t yet established exactly what the problem is, but it does seem
to be with our carrier board, my guess is connection problems on the TX2
connector when it gets hot.

I’m going to mark this as the solution for now, since I don’t believe
there is an underlying problem with the TX2 module or firmware.

Thanks for your help

Cary

(Can’t see where to mark this as the solution).