SPI DMA transfer errors

I’m trying to set up the TX1 as an SPI slave device to read more or less continuously using the spi-tegra124-slave.c driver. This is a read only case for the TX1.

Aside from truncated transfers which result in error -EIO, which I’m calling “dropped transfers”, this seems to work well at nearly any clock rate and dma transfer size, so long as the master is idle for a millisecond or so on either side of each dma buffer sized transfer.

In the dropped transfer case, the tegra_spi_rx_dma_complete() callback never gets called by the dmaengine, but tegra_spi_isr_spi() does get called to handle the dma complete interrupt. This results in tegra_spi_handle_message() timing out while determining wait_status, resetting the spi controller, and returning -EIO.

If the master’s idle period is too short, these controller resets tend to result in more dropped transfers which cascade continually. Even worse, it sometimes results in bit shifted data which is effectively corrupt.
The rate of dropped transfers seems to decrease when the master’s idle period is increased, but even at a 25% data transfer duty cycle, they are still present, so that isn’t a reliable solution. Especially since we’re trying to maximize overall throughput.

My goal is to elliminate these dropped transfers entirely.

It would also be useful to know what’s causing them, and would be especially useful to be able to reliably guarantee recovery after a dropped transfer.

Can anyone shed some light on what’s going on here?

Hi
You can reference to below topic for your use case.

https://devtalk.nvidia.com/default/topic/982253/

Hi, thanks for the reply.

I’ve run across this thread in previous searches, but I’ve just scoured it again thouroughly to see if I could learn anything new from it.

I’ve reverted to an unmodified spi-tegra124-slave module and added the change recommended in your post on 01/18/2017 to get rid of the following error:

[16279.770648] spi-tegra124-slave 7000d600.spi: Pin group name for clock line is not defined.
[16279.779559] spi-tegra124-slave: probe of 7000d600.spi failed with error -22

My device tree looks like this:

spi1: spi@7000d600 {
		compatible = "nvidia,tegra124-spi-slave";
		reg = <0x0 0x7000d600 0x0 0x200>;
		interrupts = <0 82 0x04>;
		nvidia,dma-request-selector = <&apbdma 16>;
		iommus = <&smmu 14>;
		#address-cells = <1>;
		#size-cells = <0>;
		dmas = <&apbdma 16>, <&apbdma 16>;
		dma-names = "rx", "tx";
		nvidia,clk-parents = "pll_p", "clk_m";
		nvidia,rx-trigger-words = <8>;
		spi-max-frequency = <45000000>;
		status = "okay";

		prod-settings {
			#prod-cells = <3>;

			prod {
				prod = <0x4 0x0 0x16>;
			};

			prod_c_flash {
				status = "disabled";
				prod = <0x4 0xffffffc0 0x6>;
			};

			prod_c_loop {
				status = "disabled";
				prod = <0x4 0xfffff000 0x44b>;
			};
		};

		spi1_0 {
			compatible = "spisb";
			#address-cells = <1>;
			#size-cells = <0>;
			reg = <0>;
			spi-max-frequency = <45000000>;
		};
	};

In the second post from notthetup, he describes a problem that is nearly identical to mine. The rx-dma-err occurs frequently but not on every other transfer.

[  231.630569] spi-tegra124-slave 7000d600.spi: Rx len:2048 bpw:16 10000us 12000000Hz
[  231.630730] spi-tegra124-slave 7000d600.spi: The def 0x13f08000 and written 0x3f0902f
[  231.630906] spi-tegra124-slave 7000d600.spi: Starting rx dma for len:2048
[  231.631160] spi_master spi1: Before DMA EN
[  231.631359] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0400]:BSY I:255 B:1024
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  231.632766] spi_master spi1: @isr
[  231.632985] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[40ff0264]:RDY I:255 B:612
                FIFO[01400004]:RxF:2 TxE:64 Err[] RxSTA[] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  231.633313] spi_master spi1: transferred[612] != requested[1024]
[  231.640306] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0264]:BSY I:255 B:612
                FIFO[01400004]:RxF:2 TxE:64 Err[] RxSTA[] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  231.661299] spi_master spi1: rx-dma-err [status:01400004]
[  231.667122] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0264]:BSY I:255 B:612
                FIFO[01400004]:RxF:2 TxE:64 Err[] RxSTA[] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  231.687115] spi_master spi1: after controller reset
[  231.687199] spi_master spi1:   CMD[13f08000]:  Sl M1 CS0 [HHHH] LSB MSb   Un 1b TRANS[00ff0000]:BSY I:255 B:0
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:0
[  231.687364] spi_master spi1: failed to transfer one message from queue

If I add a few prints to the source code and enable tspi->variable_length_transfer like this:

--- a/drivers/spi/spi-tegra124-slave.c
+++ b/drivers/spi/spi-tegra124-slave.c
@@ -1209,7 +1209,7 @@ static int tegra_spi_start_transfer_one(struct spi_device *spi,
 			command1 |= SPI_CONTROL_MODE_3;
 
 		tegra_spi_writel(tspi, command1, SPI_COMMAND1);
-		tspi->variable_length_transfer = false;
+		tspi->variable_length_transfer = true;
 		if (cdata && cdata->variable_length_transfer)
 			tspi->variable_length_transfer = true;
 	} else {
@@ -1393,6 +1393,7 @@ static int tegra_spi_wait_on_message_xfer(struct tegra_spi_data *tspi)
 	/* wait for spi isr */
 	ret = wait_for_completion_interruptible_timeout(
 				&tspi->xfer_completion, timeout);
+  dev_dbg(tspi->dev, "ret = %d after wait_for_completion_interruptible_timeout spi isr\n", ret);
 	if (ret <= 0) {
 		/* interrupted-wait/late interrupt is considered spurious */
 		atomic_set(&tspi->isr_expected, 0);
@@ -1556,11 +1557,13 @@ static int tegra_spi_transfer_one_message(struct spi_master *master,
 		is_first_msg = false;
 		/* wait for spi-interrupt and handle transfer error */
 		ret = tegra_spi_wait_on_message_xfer(tspi);
+    dev_dbg(tspi->dev, "ret = %d after tegra_spi_wait_on_message_xfer\n", ret);
 		if (ret)
 			goto exit;
 		tegra_clk_pin_control(false, tspi);
 		/* unpack and copy data to client */
 		ret = tegra_spi_handle_message(tspi, xfer);
+    dev_dbg(tspi->dev, "ret = %d after tegra_spi_handle_message\n", ret);
 		if (ret < 0)
 			goto exit;
 
@@ -1759,7 +1762,8 @@ static int tegra_spi_probe(struct platform_device *pdev)
 	tspi->dma_req_sel = pdata->dma_req_sel;
 	tspi->clock_always_on = pdata->is_clkon_always;
 	tspi->rx_trig_words = pdata->rx_trig_words;
-	tspi->clk_pin = (char *)pdata->clk_pin;
+	tspi->clk_pin = NULL;
+  /*(char *)pdata->clk_pin;
 	if (tspi->clk_pin) {
 		tspi->clk_pin_state_enabled = true;
 		tspi->pctl_dev = pinctrl_get_dev_from_of_compatible
@@ -1781,6 +1785,7 @@ static int tegra_spi_probe(struct platform_device *pdev)
 		dev_err(&pdev->dev, "Pin group name for clock line is not defined.\n");
 		goto exit_free_master;
 	}
+  */
 
 	tspi->gpio_slave_ready = pdata->gpio_slave_ready;

I get this:

[ 2077.048954] spi-tegra124-slave 7000d600.spi: Rx len:2048 bpw:16 10000us 12000000Hz
[ 2077.048985] spi-tegra124-slave 7000d600.spi: The def 0x13f08000 and written 0x3f0902f
[ 2077.049014] spi-tegra124-slave 7000d600.spi: Starting rx dma for len:2048
[ 2077.049087] spi_master spi1: Before DMA EN
[ 2077.049107] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0400]:BSY I:255 B:1024
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[ 2077.049846] spi_master spi1: @isr
[ 2077.049874] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[40ff0311]:RDY I:255 B:785
                FIFO[00c00004]:RxF:1 TxE:64 Err[] RxSTA[] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[ 2077.049939] spi-tegra124-slave 7000d600.spi: ret = 1000 after wait_for_completion_interruptible_timeout spi isr
[ 2077.049952] spi-tegra124-slave 7000d600.spi: ret = 0 after tegra_spi_wait_on_message_xfer
[ 2078.048484] spi_master spi1: rx-dma-timeout [wait:0]
[ 2078.048963] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0311]:BSY I:255 B:785
                FIFO[00c00004]:RxF:1 TxE:64 Err[] RxSTA[] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[ 2078.049320] spi_master spi1: after controller reset
[ 2078.049679] spi_master spi1:   CMD[13f08000]:  Sl M1 CS0 [HHHH] LSB MSb   Un 1b TRANS[00ff0000]:BSY I:255 B:0
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:0
[ 2078.049904] spi-tegra124-slave 7000d600.spi: ret = -5 after tegra_spi_handle_message
[ 2078.050336] spi_master spi1: failed to transfer one message from queue

Which I think clearly illustrates exactly what’s going on. This is what’s baffling me, and I was unable to find a solution in the link you posted above.

Hi
Did you get above failed when try the loopback test?

I haven’t run the loopback test exactly as described here https://devtalk.nvidia.com/default/topic/982253/
I could probably set that up if it’s important.

If you’re just looking for proof that it’s working,
much of the time, the transfers are successful and look like this:

[  164.912998] spi-tegra124-slave 7000d600.spi: Rx len:2048 bpw:16 10000us 12000000Hz
[  164.913027] spi-tegra124-slave 7000d600.spi: The def 0x13f08000 and written 0x3f0902f
[  164.913048] spi-tegra124-slave 7000d600.spi: Starting rx dma for len:2048
[  164.913091] spi_master spi1: Before DMA EN
[  164.913105] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0400]:BSY I:255 B:1024
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  164.916830] spi_master spi1: @isr
[  164.916845] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[40ff0400]:RDY I:255 B:1024
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  164.916906] spi_master spi1: rx-dma-complete
[  164.916920] spi_master spi1:   CMD[03f0902f]:  Sl M0 CS0 [HHHH] LSB MSb Rx  Pa 16b TRANS[00ff0400]:BSY I:255 B:1024
                FIFO[00400005]:RxF:0 TxE:64 Err[] RxSTA[E] TxSTA[E]DMA[00110000]:    RxTr:2 TxTr:2 B:1023
[  164.916954] spi-tegra124-slave 7000d600.spi: ret = 1000 after wait_for_completion_interruptible_timeout spi isr
[  164.916961] spi-tegra124-slave 7000d600.spi: ret = 0 after tegra_spi_wait_on_message_xfer
[  164.916991] spi-tegra124-slave 7000d600.spi: ret = 0 after tegra_spi_handle_message
[  164.916997] spi-tegra124-slave 7000d600.spi: Profile: size=2048B time-from-spi-int=130401ns

The test setup is another microcontroller plugged in as master @ 12MHz with the chip select toggling off and on everly 16 bits. The transfer is continuous with 3.5ms gaps between each set of 2048 bytes. During the gaps the clock stops and the chip select is deasserted.

Hi Wander
Do you find the solution for your use case?

Unfortunately no, not completely. It has improved a little.

Currently 99.8% of transfers succeed when the master leaves a gap > 1ms between transfers @ 20MHz
(You might notice that that’s a terribly inefficient use of the bandwidth that should be available)

My working theory is that the transfers that fail are all due to the slave (Jetson TX1) not being ready to accept a transfer when the transfer begins. The less time spent in computation between receiving a DMA transfer and setting up the next one, the more likely it will succeed, it seems.

One of the most helpful changes to the spi-tegra124-slave driver that I’ve made was to minimize the timeout for the calls to wait_for_completion_interruptible_timeout(). Be default the timeout is 1000us, and has no effect unless a transfer fails. Minimizing it prevents cascading failures after the first one by allowing it to get right back to the next transfer instead of waiting for a completion that never gets set.

I’ve been busy with other parts of the system so I haven’t had much time to work on the SPI bus, but I would like to get it to work at full speed without requiring gaps between transfers.

The only thing I know to try is to make use of the Linux SPI transfer queue to further minimize processing between DMA transfers. So far I’ve been unable to figure out how to make use of the queue with the spi-tegra124-slave driver.

Hi Wander
Have you try to boost the system by run the jetson_clock.sh to try?

I have not. I was just trying it out. It looks like it disables all throttling and sets all clock speeds to maximum, so it could help.

It runs on the TX1 devkit board, but we’re using the J120 carrier board from Auvidea, and it won’t run there because it can’t access the fan speed. I know the fan is wired a little differently on the J120, but I’m not sure exactly how.

In order to test it, I’ll either have to modify the jetson_clock.sh script to access the fan speed differently, or modify the hardware and software to plug into the devkit board and test with a different SPI bus.

Hi Wander
Did you modify the Fan control? Just want to confirm boost the TX1 to check if the problem can be fixed.

I’m afraid I’ll need either help or more time to make that happen.

I think the issue is in the device tree because:
Using the same kernel and boot configuration,
when the Jetson is booted on the devkit board
/sys/kernel/debug/tegra_fan/target_pwm
exists,
when the Jetson is booted on the J120
it does not exist.

I think you can just comment the fan control to try.

Hi All,
Please take time to help to verify the patches from below link.