SPI error on Jetson Nano

Hi all,

There seems to be some issue with the SPI on my jetson nano, where it would stops working after {x} amount of time. The weird thing with this error is that it also seems to power cycle the Jetson Nano without doing a safe shutdown

Any help with this error will be helpful

Error logs here
Oct 21 01:53:09 BPU kernel: [ 29.454835] CPU: 1 PID: 1352 Comm: irq/66-7000d400 Tainted: G D 4.9.140-tegra #1
Oct 21 01:53:09 BPU kernel: [ 29.454837] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
Oct 21 01:53:09 BPU kernel: [ 29.454839] task: ffffffc0f90af000 task.stack: ffffffc0f7ff4000
Oct 21 01:53:09 BPU kernel: [ 29.454847] PC is at kthread_data+0x24/0x30
Oct 21 01:53:09 BPU kernel: [ 29.454852] LR is at irq_thread_dtor+0x2c/0xd8
Oct 21 01:53:09 BPU kernel: [ 29.454855] pc : [] lr : [] pstate: 60400045
Oct 21 01:53:09 BPU kernel: [ 29.454856] sp : ffffffc0f7ff7960
Oct 21 01:53:09 BPU kernel: [ 29.454861] x29: ffffffc0f7ff7960 x28: ffffffc0f90af000
Oct 21 01:53:09 BPU kernel: [ 29.454865] x27: 0000000000000000 x26: 0000000000000000
Oct 21 01:53:09 BPU kernel: [ 29.454869] x25: ffffff800a09a267 x24: ffffff8009e65000
Oct 21 01:53:09 BPU kernel: [ 29.454872] x23: 00000000000001c0 x22: ffffff800a144090
Oct 21 01:53:09 BPU kernel: [ 29.454875] x21: 0000000000000000 x20: ffffffc0f90af000
Oct 21 01:53:09 BPU kernel: [ 29.454879] x19: ffffffc0f90af000 x18: 0000000000000001
Oct 21 01:53:09 BPU kernel: [ 29.454882] x17: 0000000000000000 x16: ffffffc0f7ff7e10
Oct 21 01:53:09 BPU kernel: [ 29.454885] x15: ffffffffffffffff x14: ffffff800a1491e0
Oct 21 01:53:09 BPU kernel: [ 29.454888] x13: ffffff800a148e39 x12: ffffff8009e84000
Oct 21 01:53:09 BPU kernel: [ 29.454892] x11: 0000000000000000 x10: ffffff800a148000
Oct 21 01:53:09 BPU kernel: [ 29.454895] x9 : 0000000000000000 x8 : ffffffc0f90af498
Oct 21 01:53:09 BPU kernel: [ 29.454898] x7 : ffffffc0f90af4a8 x6 : ffffffc0fefd75a0
Oct 21 01:53:09 BPU kernel: [ 29.454902] x5 : 0000000000000004 x4 : ffffffc0f90af81c
Oct 21 01:53:09 BPU kernel: [ 29.454905] x3 : ffffffc0f7ff7e10 x2 : 0000000000000000
Oct 21 01:53:09 BPU kernel: [ 29.454908] x1 : ffffff8008123a70 x0 : 0000000000000000
Oct 21 01:53:09 BPU kernel: [ 29.454909]
Oct 21 01:53:09 BPU kernel: [ 29.454912] Process irq/66-7000d400 (pid: 1352, stack limit = 0xffffffc0f7ff4000)
Oct 21 01:53:09 BPU kernel: [ 29.454914] Call trace:
Oct 21 01:53:09 BPU kernel: [ 29.454918] [] kthread_data+0x24/0x30
Oct 21 01:53:09 BPU kernel: [ 29.454923] [] task_work_run+0xbc/0xd8
Oct 21 01:53:09 BPU kernel: [ 29.454928] [] do_exit+0x2c4/0xa08
Oct 21 01:53:09 BPU kernel: [ 29.454934] [] bug_handler.part.2+0x0/0x88
Oct 21 01:53:09 BPU kernel: [ 29.454940] [] __do_kernel_fault.isra.1+0x144/0x218
Oct 21 01:53:09 BPU kernel: [ 29.454943] [] do_page_fault+0x60/0x518
Oct 21 01:53:09 BPU kernel: [ 29.454947] [] do_translation_fault+0x6c/0x80
Oct 21 01:53:09 BPU kernel: [ 29.454949] [] do_mem_abort+0x54/0xb0
Oct 21 01:53:09 BPU kernel: [ 29.454952] [] el1_da+0x24/0xbc
Oct 21 01:53:09 BPU kernel: [ 29.454958] [] handle_cpu_based_xfer+0x78/0x240
Oct 21 01:53:09 BPU kernel: [ 29.454961] [] tegra_spi_isr_thread+0x3c/0x48
Oct 21 01:53:09 BPU kernel: [ 29.454964] [] irq_thread_fn+0x30/0x80
Oct 21 01:53:09 BPU kernel: [ 29.454967] [] irq_thread+0x11c/0x1a8
Oct 21 01:53:09 BPU kernel: [ 29.454970] [] kthread+0xec/0xf0
Oct 21 01:53:09 BPU kernel: [ 29.454973] [] ret_from_fork+0x10/0x30
Oct 21 01:53:09 BPU kernel: [ 29.454976] —[ end trace da5c6fc0d70bffba ]—
Oct 21 01:53:09 BPU kernel: [ 29.462590] Fixing recursive fault but reboot is needed!
Oct 21 01:53:19 BPU kernel: [ 39.282547] spi-tegra114 7000d400.spi: spi transfer timeout
Oct 21 01:53:19 BPU kernel: [ 39.288312] spi-tegra114 7000d400.spi: SPI_ERR: CMD_0: 0x43e01027, FIFO_STS: 0x02c00004
Oct 21 01:53:19 BPU kernel: [ 39.296350] spi-tegra114 7000d400.spi: SPI_ERR: DMA_CTL: 0x00000000, TRANS_STS: 0x40ff0014
Oct 21 01:53:19 BPU kernel: [ 39.304828] spi_master spi0: failed to transfer one message from queue
Oct 21 01:53:23 BPU kernel: [ 43.142486] usb 1-2.2: usb_suspend_both: status 0
Oct 21 01:53:23 BPU kernel: [ 43.147305] usb 1-2.4: usb_suspend_both: status 0

Could you tell more detail to reproduce it.

Sorry about that, here’s a detail explanation of what happened.

System:

  • Jetpack 4.4 production version
  • b02 version dev board
  • Enabled SPI using jetson-io
  • Using python-spidev to communication with a SPI sensor

I have a simple python code that send commands to the SPI sensor, wait for 1s and try and read data back from it. The code works as I was able to read data off it, however the code seems to stop working after 2 hours and I get this error

2020-10-15T03:09:07.265225293Z presien.ssb.ssb_command:send_receive_cmd - WARNING - SPI error: [Errno 5] Input/output error

I then put a try except around the spidev functions, so when I catch a input/output error I’ll close the SPI connection and try to recreate the SPI connection again. Whenever I do so, I get the recursive error in syslog and it just decides to reboot the nano (the reboot is not a safe reboot, it just seems as someone unplug the power and plug it back in)

That’s pretty much it, not exactly sure what is going on.

Could you try the spidev_test?

I did and the test works, I even modify it by putting a while loop around the transfer() function to see if the SPI fails when it is continuously sending data from MOSI to MISO.

I think you may need to narrow why the python cause this problem.

Here’s the python code that send and receive data from a separate micro controller. The code below is running with the Jetson Nano b02 dev kit at python3, py-spidev v3.5. SPI clock speed @250kHz. The micro controller acting as a slaves require the master to send the command to it first, before reading data from it.

    with self.spi_lock:
        try:
            msg = self.generate_msg(frame_type, func_master, data)
            self.spi.writebytes(msg)
            sleep(self.cmd_settings['cmd_res_delay'])
            ret_msg = self.spi.readbytes(20)
            result, data = self.check_return_data(frame_type, func_slave, ret_msg)
            if result:
                return data

            return None
        except Exception as e:
            log.warning(f"SPI error: {e}")
            self.close_device()
            self.setup_spi()
            return None

As you can see I’m just sending and reading data, what happened is after running for awhile the spi function throw an input output error and once I catch this error I will try and close the SPI device and recreate it.

SPI error: [Errno 5] Input/output error

When I am closing the device using the function self.close_device() (basially just self.spi.close()), the kernel decide to just cut power to the Nano and reboot, which is logged in the kernal logs (refer to my initial post for the full kernal logs).

Oct 21 01:53:09 BPU kernel: [ 29.462590] Fixing recursive fault but reboot is needed!
Oct 21 01:53:19 BPU kernel: [ 39.282547] spi-tegra114 7000d400.spi: spi transfer timeout
Oct 21 01:53:19 BPU kernel: [ 39.288312] spi-tegra114 7000d400.spi: SPI_ERR: CMD_0: 0x43e01027, FIFO_STS: 0x02c00004
Oct 21 01:53:19 BPU kernel: [ 39.296350] spi-tegra114 7000d400.spi: SPI_ERR: DMA_CTL: 0x00000000, TRANS_STS: 0x40ff0014
Oct 21 01:53:19 BPU kernel: [ 39.304828] spi_master spi0: failed to transfer one message from queue

It is hard for me to guide you to reproduce the error as I am communicating with a custom micro controller, however it seems to me like this is not a python spidev issue and more of a kernel issue.

I’ve been scratching my head with this issue and I’m stuck with this as I have not much knowledge with linux kernel

@ShaneCCC ^

I have no idea now. Will check if NV developer have any idea about it.

After doing some debugging, looking through spi_tegra.c kernel module and py-spidev source code, I found that within py-spidev the write/readbytes() function has a different implementation compared to xfer. the write/read bytes function uses linux “write” function and xfer uses “ioctl” function.

Both function works fine, however after running the write/read after {x} hours it will fail and throw an syslog error and causes a force reboot. Changing the code to use the “xfer” function solves this issue.

I still don’t know the root cause of it, but it’s something to take note of.