Occasional WiFi disconnects: CFG80211-ERROR) wl_cfg80211_hang : In : chip crash eventing

We’re seeing some occasional WiFi disconnects on our TX2 boards (emphasis mine):

[12271.511539] <b>dhd_bus_rxctl: resumed on timeout, INT status=0x20800040</b>
[12271.518373] <b>dhd_bus_rxctl: rxcnt_timeout=1, rxlen=0</b>
[12271.523257] <b>dhd_check_hang: Event HANG send up due to  re=1 te=0 e=-110 s=2</b>
[12271.528946] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[12271.537268] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.543196] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.543203] dhd_check_hang: Event HANG send up due to  re=1 te=0 e=-110 s=2
[12271.556079] CFG80211-ERROR) wl_cfg80211_get_station : 
[12271.556079] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.567127] NOT assoc, error -1
[12271.570283] CFG80211-ERROR) wl_cfg80211_disconnect : Reason 3
[12271.576041] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.581970] CFG80211-ERROR) wl_cfg80211_disconnect : error (-1)
[12271.709131] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[12271.737499] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.743422] CFGP2P-ERROR) wl_cfgp2p_bss_isup : 'cfg bss -C 0' failed: -1
[12271.750131] CFGP2P-ERROR) wl_cfgp2p_bss_isup : NOTE: this ioctl error is normal when the BSS has not been created yet.
[12271.760833] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.766750] CFG80211-ERROR) wl_notifier_change_state : wlan0:error(-1)
[12271.773285] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.779233] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.796446] CFGP2P-ERROR) wl_cfgp2p_set_management_ie : vndr ie set error : -1
[12271.803736] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.809659] CFG80211-ERROR) wl_dongle_down : WLC_DOWN error (-1)
[12271.895207] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[12271.953571] wl_android_wifi_off in
[12271.956980] tegra_sysfs_off
[12271.959774] tegra_sysfs_rf_test_disable
[12271.963615] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.969532] dhd_prot_ioctl : bus is down. we have nothing to do
[12271.975465] dhd_wl_ioctl_get_intiovar: get int iovar ampdu_hostreorder failed, ERR -1
[12271.995329] dhd_prot_ioctl : bus is down. we have nothing to do
[12272.001260] dhd_wl_ioctl_set_intiovar: set int iovar tlv failed, ERR -1
[12272.007949] Disabling wake69
[12272.008072] sdhci-tegra 3440000.sdhci: Tuning already done, restoring the best tap value : 20
[12272.021466] wifi_platform_set_power = 0
[12272.080282] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[12272.225938] <b>CFG80211-ERROR) wl_cfg80211_hang : In : chip crash eventing</b>
[12272.246622] cfg80211: World regulatory domain updated:
[12272.251768] cfg80211:  DFS Master region: unset
[12272.256129] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[12272.265347] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[12272.272922] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[12272.280928] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz, 92000 KHz AUTO), (N/A, 2000 mBm), (N/A)
[12272.290327] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
[12272.298344] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
[12272.307830] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
[12272.317312] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[12272.325404] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[12272.333408] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)

We’re running a 4.4 kernel, with the RT-patch applied, and a few extra patches also, see the end of the post. Firmware version for the WiFi is:

Firmware version = wl0: Dec 12 2017 15:09:35 version 7.35.221.34 (r679642) FWID 01-e35dbe99

We’ve seen the exact above issue on a TX2 a single time, but have seen a couple of other cases where the symptoms have been the same, but without having logs available.

Some notes:

  1. The issue happened after the TX2 had been on for roughly 3.5 hours. No WiFi issues before that, and we've seen our TX2's with the same software running for much longer without issues.
  2. It was stationary when the issue happened, but had been moving around a few meters prior to that.
  3. It was connected to a single 5GHz Access Point, so should have no possibility of roaming.
  4. As can be seen, we were simultaneously seeing ``` serial-tegra 3110000.serial: RxData DMA copy to tty layer failed ``` errors, I'm assuming due to one of our applications having gone into an error state, and no longer servicing the serial port that it's otherwise consuming data from. I'm not sure if this might affect the WiFi subsystem?
  5. After the error, I could log in over the serial console, and issue an ifdown wlan0 / ifup wlan0, which caused the WiFi to come back up.

Any ideas what the issue could be? - or any tips on how to debug further? We sadly don’t yet have a surefire way of reproducing the issue, but are working on it currently.

I can see also that we’re not using the newest firmware, but I haven’t been able to find anywhere to download that, nor any changelog?

Any help would be appreciated!

The patches we use that relate to WiFi are the following:

From https://devtalk.nvidia.com/default/topic/1047138/jetson-tx1/wifi-disconnect-problem-on-jetpack-3-3/2
------------------ drivers/net/wireless/bcmdhd/wl_cfg80211.c ------------------
index 9d3568d18421..8f5f11d28968 100644
@@ -9935,6 +9935,7 @@ wl_cfg80211_verify_bss(struct bcm_cfg80211 *cfg, struct net_device *ndev)
 	do {
 		bss = CFG80211_GET_BSS(wiphy, NULL, curbssid,
 			ssid->SSID, ssid->SSID_len);
+		cfg->wdev->ssid_len = ssid->SSID_len;
 		if (bss || (count > 5)) {
 			break;
 		}
From https://devtalk.nvidia.com/default/topic/1047319/jetson-tx2/disable-wifi-powersave
------------------- drivers/net/wireless/bcmdhd/dhd_linux.c -------------------
index a1aa56926ceb..89c3334660e9 100644
@@ -6154,6 +6154,7 @@ dhd_preinit_ioctls(dhd_pub_t *dhd)
 #endif 
 	}
 
+        dhd_slpauto_config(dhd, 0);
 	DHD_ERROR(("Firmware up: op_mode=0x%04x, MAC="MACDBG"\n",
 		dhd->op_mode, MAC2STRDBG(dhd->mac.octet)));
 	/* Set Country code  */
---------------------------- net/wireless/nl80211.c ----------------------------
index bf65f31bd55e..868eec3d8da4 100644
@@ -8659,8 +8659,15 @@ static int nl80211_set_power_save(struct sk_buff *skb, struct genl_info *info)
 
 	state = (ps_state == NL80211_PS_ENABLED) ? true : false;
 
+/*	This check has been commented out, to ignore the internally saved
+	power management state, and just always send the on or off command.
+	There seems to be something that can turn on power saving without it
+	being reflected in the internal state, so removing this allows to keep
+	periodically sending power_save off commands (using the userspace iw
+	utility), without turning it on inbetween. 
 	if (state == wdev->ps)
-		return 0;
+		return 0;*/
 
 	err = rdev_set_power_mgmt(rdev, dev, state, wdev->ps_timeout);
 	if (!err)

According to your description, it looks like you are using rel-28 based release.

Would you mind moving to rel-32 and see if this issue is still there? Also, could you go back to pure jetpack + devkit and only connect to this 5G AP to see if it could be reproduced?

Yes, that is correct. Our kernel is built on top of 28.2.1, with gcc-4.8.5. Note that we’re only using the L4T kernel, devicetree and U-Boot though - we’re building our own userspace and rootfs with ptxdist (it’s a system similar to Yocto).

That’s sadly difficult, for a couple of reasons. First, we’re using an out-of-tree camera driver, and from what I know the driver is only available for the 4.4 kernel - but I’d need to check up on it. Second, that’d also mean fully revalidating a new kernel on our system.

Do you have any particular changenotes for rel-32 in mind that could affect this issue?

In any case though, before moving on to test any changes, I’ll need to devise a (semi)robust way of reproducing the issue first. Any hints on what could trigger such an error? - or debugging options to add?

Hi sfalsig,

Sorry that I am not able to give you a solid answer to it.
This issue maybe possibly resolved in firmware which is provided by vendor or inside kernel driver. Or maybe it is not resolved at all. We are not sure.

Rel-28.2.1 has been released for two years and has large gap with rel-32.

Please try to use the pure jetpack first and see if you could reproduce issue. You may try on rel-28.2.1 first and then rel-32.3 or rel-28.3.x. We can only investigate this issue with pure jetpack + devkit.

In any case though, before moving on to test any changes, I’ll need to devise a (semi)robust way of reproducing the issue first. Any hints on what could trigger such an error? - or debugging options to add?

Your issue seems not able to reproduce easily. What I can tell you is enable more verbose log here.

sudo -s
 echo 0x10801 > /sys/module/bcmdhd/parameters/dhd_msg_level
 echo 120 > /sys/module/bcmdhd/parameters/dhd_console_ms

also, try to get the snooping packet from some tools like wireshark. Try different AP may be considered too.

And please provide full dmesg with us instead of partial one.

Thanks for the hints!

I actually managed to reproduce the issue once yesterday - it again seems to be linked to some camera and DMA issues. The full dmesg log is attached, showing the system running.

  • 0-21 seconds: Initial boot messages
  • 2114-5508 seconds: Our application starts streaming from a camera, runs for a minute or two, then stops again. This is repeated in 10-30 minute intervals.
  • 5904-6585 seconds: The application starts the camera as usual, but something seems to go wrong, and our application hangs - this in turn causes it to not service the serial port, which I assume is what is given us the RxData DMA copy failed messages.
  • 6735-11427 seconds: After logging in to manually restart our application, the system runs fine again for a while.
  • 12216-12594 seconds: The application again starts the camera as usual, but has an error very similar to the previous case. Additionally, at 12271 seconds, the WiFi driver seems to time out, and crashes - the two times I have seen logs of the WiFi crash happening, it's coincided with the camera issues.
  • 13217-14423 seconds: I log in over the serial console, stop our application run ifdown wlan0 / ifup wlan0, which brings the WiFi back up, then do a poweroff of the board.

I had another look at 28.3, and that might be a possibility for us - I’ll have a look at trying it out as soon as I have confirmed that I can reproduce the issue consistently.

By “pure jetpack + devkit”, do you mean that I need to use the nvidia-supplied rootfs also? That is sadly not going to be possible for us, as we would need to redo our entire set of libraries and toolchain for compiling our application.
console-ramoops-0.txt (512 KB)

Better using the attachment for your log…

By “pure jetpack + devkit”, do you mean that I need to use the nvidia-supplied rootfs also? That is sadly not going to be possible for us, as we would need to redo our entire set of libraries and toolchain for compiling our application.

Unfortunately, we cannot help you debug if you cannot simplify the usecase to devkit and jetpack.
Your usecase and error looks complicated. If even a camera is getting involved to crash the wifi, then we need internal team to help check. Internal team only checks such issue if this issue can be reproduced on nvidia devkit + jetpack. For your case, there are too may unknown factors.

Argh - I was looking for the attachment button in the reply editor - but can see now that it only comes up after you post… Changed it…

I understand - I was hoping that someone had seen this type of issue before, and knew what could cause it, to help narrow in on the problem. I’ll keep trying to better reproduce it though. If I find a good way of doing so, I’ll see if I can manage to do it in devkit + jetpack too.