L4T R21.1 SATA issue?

Hi all,

I just upgraded Jetson with L4T R21.1 and noticed my SATA HDD started showing timeouts and frozen commands. It worked perfect on 19.3 but I had to repartition it to use in 21.3 (since I have to boot from eMMC with u-boot I mounted only /home, /var and /tmp on HDD to save eMMC resource) so one possibility is hardware issue.

Does anybody have similar hangs and freezes i.e. it is a driver issue, or I should look into hardware? I saw some complaints on newly introduced delays here on forum so I am not sure…

Thanks!

PS I am running USB3, if it cares.

UPD: Confirming this is a bug - just returned to R19.3 without touching hardware, and HDD works perfectly (allocated partition on free space and did e2fsck -c). Maybe the issue is caused with sda1 partition starting not at sector 2048 but in the middle of hard drive (about 160000000 sectors ahead)

I am not set up for testing with SATA on Jetson, but this is fairly common code on any architecture. What were the errors? You mentioned also USB3…is the drive connected via SATA controller or via USB3 controller?

Hi GrayDaemon,

Can you please provide few more details:

  1. Vendor details of SATA HDD

    #lsscsi -v ("#apt-get install lsscsi" if not already installed)

  2. Provide kernel logs i.e attach the file “/var/log/dmesg”

Dear linuxdev,

the drive is connected to SATA. USB3 was mentioned because I had to change original kernel arguments to enable it against vanilla R21.1, no other changes were performed.

Dear Madhava,

root@tegra-ubuntu:/var/log# lsscsi -v
[0:0:0:0] disk ATA WDC WD7500BPKX-2 01.0 /dev/sda
dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/platform/tegra-sata.0/ata1/host0/target0:0:0/0:0:0:0]

/var/log/dmesg: http://www.gsmpager.ru/temp/var-log-dmesg-21.1-from-filesystem

But there is output to console that did not fall into dmesg file:

http://www.gsmpager.ru/temp/dmesg-21.1-sata-errors.txt

HDD surface seems to be OK, I allocated this space under R19.3 and did e2fsck -c.

Below is fdisk output for reference, if needed.

root@tegra-ubuntu:/var/log# fdisk /dev/sda

Disk /dev/sda: 750.2 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders, total 1465149168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x83dfffb5

Device Boot Start End Blocks Id System
/dev/sda1 160002048 222916607 31457280 83 Linux
/dev/sda2 222916608 264859647 20971520 83 Linux
/dev/sda3 264859648 285831167 10485760 83 Linux
/dev/sda4 285831168 302608383 8388608 83 Linux

Try adding the package for smart monitoring: smartmontools

Then run:

smartctl -a /dev/sda

Is anything there reported about errors or failures? If not, do you have another SATA cable to test with?

SMART report is at http://www.gsmpager.ru/temp/smart.txt , looks good for me.

Yes, I have other cables and can try but the point is that R19.3 works perfect with existing cable and HDD (I have to return to 19.3 immediately after tests to continue my work), no single error happens. Also I am using SATA3 cable with latches (should be good at least by design).

The SMART error report shows the drive itself should be healthy. Cable speeds were set to SATA2 speeds although the drive is capable of SATA3…chances of marginal cables causing issues are reduced by this, and your testing with same cable pretty much confirms it isn’t a cable issue.

This looks to be more of a software issue, but a question remains as to the cause being part of SATA drivers or something preventing SATA drivers from becoming available when required. There are timeouts associated with many drivers where it is assumed to be an error if the device does not respond fast enough…this seems to be the case. The subtle issue is whether the driver itself causes the problem, or if something else starved out the driver’s ability to run at the time needed to respond fast enough.

On current L4T, a look at /proc/interrupts will show only CPU0 is handling hardware interrupts (on an Intel x86 motherboard all CPUs are listed, but unless the IO-APIC is enabled, only CPU0 will show handling of interrupts…I do not know if Jetson requires something like an IO-APIC to distribute hardware interrupts over all CPUs instead of just CPU0, or if it is just the way the software is designed). Software can use any CPU, but drivers for hardware specifically require CPU0 under current design. These IRQs trigger the driver handling the SATA transfers. If the IRQ does not get serviced quickly enough data will be lost and perhaps show up as a SATA error. With so many things requiring CPU0, the system could be showing signs of “interrupt starvation”. Under current test data, I do not know how to differentiate between a failure of the SATA driver and interrupt starvation.

What comes to mind is first to find out if the SATA driver itself changed from R19.3 to R21.1. If so, there may be information on a known SATA bug which likely would be corrected in a later kernel (this would also be an issue on other architectures, not just ARMv7). If there is an interrupt starvation issue, it is possible changes to CPU performance settings between R19.3 and R21.1 have revealed a weakness in performance settings (and it is known that default CPU performance settings changed between R19 and R21).

Performance settings can be tested. I do not know of all of the CPU performance setting differences, but it should be possible to find out what change in /sys would cause your R19 to have R21 default performance settings; plus the corollary, how to use /sys in R21 to make it look like R19 performance settings. Should changes to R19 features within R21 make R21 work correctly, or should changes in R19 settings to R21 behavior cause R19 to fail, you will know a workaround and also if the issue “is” interrupt starvation. If CPU performance settings do not answer this, it does not necessarily mean the issue is not interrupt starvation, it only means the starvation is not a CPU performance setting. These settings though are high on the list of suspects since they can directly change how fast CPU0 will service interrupts.

So here is a question for other people in the forums…what can be echoed to files in /sys to make R19 have the same CPU performance settings as R21? What settings in R21 can be changed via echo in /sys to make R21 look like R19 CPU performance? With this there is a way to continue testing.