How to send a big array (>256 bytes) by spi communication with DMA mode in spe-fw?

,

Hi, @ShaneCCC , @jachen , @kayccc
I want to send a big array once based on the official SPI2 demo in spe-fw. But if the size of array is greater than 256 bytes, spe will crash. I think that this error may be related to the FIFO size of the spi. From TRM manual, I found that FIFO of spi can save 256 bytes data under Packeted mode & 8 bits_per_word. My code in l4t-rt/rt-aux-cpu-demo/app/spi-app.c is:

...........................

char data_to_send[1024];
char data_to_read[1024];

static portTASK_FUNCTION(spi_test_task, pvParameters)
{
    data_to_send[0]=0x01;
    data_to_send[1]=0x23;
    data_to_send[2]=0x45;
	int ret, count;
	struct tegra_spi_xfer xfer = {
		.flags = BIT(TEGRA_SPI_XFER_FIRST_MSG) |
			 BIT(TEGRA_SPI_XFER_LAST_MSG),
		.tx_buf = mydata_to_send,
		.rx_buf = mydata_to_read,
		.len = ARRAY_SIZE(mydata_to_read),
		.chip_select = 0,
		.tx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.rx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.bits_per_word = 8,
		.mode = TEGRA_SPI_MODE_1 | TEGRA_SPI_LSBYTE_FIRST,
	};

....................

}

Based on the function: tegra_spi_calculate_curr_xfer_param(tspi, xfer)in l4t-rt/freertos-common/code-common/spi-tegra.c, if > 64 Words FIFO data is sended, DMA mode is used. But DMA mode makes spe-fw crashed. There must be some error in this function. But I don’t known how to debug. I need your help.

If I set .spi_no_dma = true, the PIO mode is used all the time. Everything is ok.

Hello, Xu_Xu:
So you mean ONLY when array size bigger than 256 bytes, and with DMA enabled, SPE firmwware crashes. For all other cases, like array size <= 256 with DMA enabled, or array size > 256 without DMA, SPE firmware works well, and SPI can TX/RX data correctly, right?

When SPE firmware crashes, do you get any log from UART?

br
Chenjian

1 Like

Hi, @jachen

Yes, only PIO mode works fine regardless of the size of the array. Based on the function tegra_spi_start_transfer_one(), if array size <= 256 (not exceeds SPI_FIFO_DEPTH( 64 Words) under Packed mode), only PIO mode(CPU mode) can be used whether DMA mode is enabled or not:

tegra_spi_start_transfer_one()
{
....................................
   	if (total_fifo_words > SPI_FIFO_DEPTH){
  		ret = tegra_spi_start_dma_based_transfer(tspi, xfer);
  	}
      else {
  		ret = tegra_spi_start_cpu_based_transfer(tspi, xfer);
      }
}

If send a <=256 Bytes array with DMA mode enabled (.spi_no_dma = false), SPE works fine without any errors or warnings. Because it uses PIO mode. I think it also means that the GPC-DMA is successfully initialized. From my point of view, the function tegra_spi_start_dma_based_transfer(tspi, xfer) may have some bugs. But I don’t know how to properly debug it. This question really bothers me. I need your help.

In addition, if I set SPI_FIFO_DEPTH=32 & send a 256 Bytes array, I meet the same error.

No, I can’t get any log from UART. Because this error can cause the flashing process to hang, when I set to send a bigger array (>256 Bytes) with DMA mode enabled (.spi_no_dma = false) to re-flash the modified spe-fw partition:

sudo ./flash.sh -r -k spe-fw jetson-xavier mmcblk0p1

[   7.9249 ] Sending bootloader and pre-requisite binaries
[   7.9275 ] tegrarcm_v2 --download blob blob.bin
[   7.9291 ] Applet version 01.00.0000
[   8.0074 ] Sending blob
[   8.0079 ] [................................................] 100%
[   8.8022 ] 
[   8.8066 ] tegrarcm_v2 --boot recovery
[   8.8084 ] Applet version 01.00.0000
[   8.8906 ] 
[   9.8957 ] tegrarcm_v2 --isapplet
[   9.8982 ] USB communication failed.Check if device is in recovery
[  10.0009 ] 
[  10.0070 ] tegrarcm_v2 --ismb2

I confirmed that system had entered recovery mode before flash by running lsusb command:

Bus 002 Device 016: ID 0955:7019 NVidia Corp.

What’s more, I test this under the spi loopback mode.

Hello, Xu_Xu:
Here are some tips for SPE debug.

  1. With issues like you met (SPI API results in flash hang), don’t make SPI routine runs automatically.
  2. You can trigger SPI task from host side. So after SPE FW is running, it just is in idle mode.
    Take a look at “./rt-aux-cpu-demo/app/ivc-echo-task.c” and search ivc_echo_task_process_ivc_messages. SPI routine should be able to trigger by message from host side.
  3. Add some print code at SPE FW, and make sure new FW works.
  4. With above changes, new SPE FW should be good for flashing, and SPE is just in idle loop.
  5. After device’s up with new firmware, trigger SPI routines and check whether there are some error logs from console.

Let me know if you get any progress.

br
ChenJian

1 Like

Thank you. Your suggestion is very useful. I will give you feedback after testing.

Hi, @jachen
Following your suggestions, I get the following error logs from console:

one dma fifo size:128
--------------------------------------------------------------------------------
Exception: Data abort
DFAR: 0x00000000, DFSR: 0x00001c06
PC: 0x0c491b38
LR: 0x0c48fa84, SP:  0x0c49ff18, PSR: 0x6000001f
R0: 0x00000000, R1:  0x00000001, R2:  0x00000200
R3: 0x00000000, R4:  0x00000000, R5:  0x00000000
R6: 0x00000000, R7:  0x00000000, R8:  0x00000000                                
R9: 0x00000000, R10: 0x00000000, R11: 0x0c000001                                
R12: 0x0c485768                                                                 
--------------------------------------------------------------------------------

Where one dma fifo size:128 is a log I set before the function tegra_spi_start_dma_based_transfer(tspi, xfer) . It means SPE-FW can enter this function. But an fatal error Exception: Data abort occurred after entering the function.

The following is my spi transfer task function. Where data_to_send[512] and data_to_read[512] are set as global array. L4T realse is r32.7.3 and SPE-FW release is r32.6.1.

char data_to_send[512];
char data_to_read[512];
extern int en_spi_flag;

static portTASK_FUNCTION(spi_test_task, pvParameters)
{
	// const uint8_t data_to_send[] = {0x01, 0x23, 0x45,0x67, 0x89, 0xab, 0xcd, 0xef};
	// uint8_t data_to_read[] = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
    data_to_send[0]=0x01;
    data_to_send[1]=0x23;
    data_to_send[2]=0x45;
	int ret, count;
	struct tegra_spi_xfer xfer = {
		.flags = BIT(TEGRA_SPI_XFER_FIRST_MSG) |
			 BIT(TEGRA_SPI_XFER_LAST_MSG),
		.tx_buf = data_to_send,
		.rx_buf = data_to_read,
		.len = ARRAY_SIZE(data_to_read),
		.chip_select = 0,
		.tx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.rx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.bits_per_word = 8,
		.mode = TEGRA_SPI_MODE_1 | TEGRA_SPI_LSBYTE_FIRST,
	};
	(void)pvParameters; /* unused */
        for (count = 0; count < SPI_TEST_RETRIES; count++) {
            memset(data_to_read,0,ARRAY_SIZE(data_to_read));
            if (en_spi_flag > 3)
            {
                ret = tegra_spi_transfer(&SPI_TEST_CONTROLLER, &xfer);
                if (ret)
                    printf("SPI TX/RX failed\r\n");
                else {
                    if (!memcmp(data_to_read, data_to_send,
                            ARRAY_SIZE(data_to_read)))
                        printf("SPI test successful\r\n");
                    else
                        printf("Received incorrect data\r\n");
                }
            }
            vTaskDelay(SPI_TEST_DELAY);
        }

	vTaskDelete(NULL);
}

Hello, Xu_Xu:
The error shows invalid memory access.
Can you run command like addr2line to check where the error happens (PC: 0x0c491b38)? It may access a NULL pointer.

br
ChenJian

1 Like

After running addr2line, I get the following log:

tegra_spi_transfer
/home/***/nvidia/r3273agx/spe/l4t-rt/freertos-common/code-common/spi-tegra.c:1130

Based this, I located near line 1130 of spi-tegra.c:

int tegra_spi_transfer(struct tegra_spi_id *id, struct tegra_spi_xfer *xfer)
{
	/* struct tegra_spi_id assumed to be first field in tegra_spi_ctlr */
	struct tegra_spi_ctlr *tspi = (struct tegra_spi_ctlr *)id;
	int ret = 0;

	dbgprintf("SPI: tegra_spi_transfer\r\n");
	tspi->busy = true;
	if (xfer->flags & BIT(TEGRA_SPI_XFER_FIRST_MSG))
		ret = tegra_spi_start_transfer_one(tspi, xfer);
	else
		ret = tegra_spi_transfer_remain_message(tspi, xfer);
	if (ret < 0) {
		error_hook("SPI: cannot start transfer");
		goto exit;
	}
	ret = tegra_spi_wait_on_message_xfer(tspi);
	if (ret)
		goto exit;
	ret = tegra_spi_handle_message(tspi, xfer); **// line 1129**
	if (ret) **//line 1130 **
		goto exit;
	if (tspi->cur_pos == xfer->len)
		goto exit;

.........................................
}

Does this mean that an error occurred in the function tegra_spi_handle_message(tspi, xfer)

Hello,
You can cross check the addr2line result with ASM code to confirm that. It should be a load or store instruction.
A possible reason is that the memory is corrupted.
You can also add some print before the function is called, and confirm whether the pointer is still valid.

br
ChenJian

1 Like

Are you referring to memory hardware corruption?

no, not hardware issue.
we should check from SW side first.
Have you ever added some print code to check the pointer?

br
ChenJian

1 Like

I add a printf before tegra_spi_handle_message():

    printf("transfer_wait_on_message,%d\r\n",ret);
	if (ret)
		goto exit;
	ret = tegra_spi_handle_message(tspi, xfer);

Then I get the following log:

:-------------------------------------------------------------------------------
Exception: Data abort
DFAR: 0x00000000, DFSR: 0x00001c06
PC: 0x0c491b98
LR: 0x0c485420, SP:  0x0c4a0050, PSR: 0x6000001f
R0: 0x00000000, R1:  0x00000001, R2:  0x00000200                                
R3: 0x00000000, R4:  0x00000000, R5:  0x00000000                                
R6: 0x00000000, R7:  0x00000000, R8:  0x00000000                                
R9: 0x00000000, R10: 0x00000000, R11: 0x0c000001                                
R12: 0x0c48528c                                                                 
--------------------------------------------------------------------------------

And the PC: 0x0c491b98 also points to the same function tegra_spi_handle_message(). But the mesg about “transfer_wait_on_message” I add doesn’t occur.

I’m sorry. I don’t know how to do this. Can you give me a example?

Hello,
Do you have extra device for your test? If not, I can do some debug locally. Please share the detailed test procedure.
In addition, can you upload spe.elf, with exactly the following error happens:
Exception: Data abort
DFAR: 0x00000000, DFSR: 0x00001c06
here? (Since it’s a public forum, please make sure that the ELF file does not contain any sensitive code, and only debug code is in.)

Also, please share the original SPE package download link you are using.

br
ChenJian

1 Like

Target device: AGX Xavier Devkit 16G.
Software: L4t r32.7.3, SPE-FW r32.6.1
Test procedure:
(1) I enable IVC, CAN, SPI official demo based on the Official Tutorial, And IVC, CAN, SPI(only PIO mode) work fine.
(2) Modified code in l4t-rt/rt-aux-cpu-demo/app/spi-app.c:

#define SPI_TEST_RETRIES	5000 
#define SPI_TEST_DELAY		2000 

char data_to_send[512]; // my add
char data_to_read[512]; // my add
extern int en_spi_flag; //set a flag to enable spi by the IVC

static portTASK_FUNCTION(spi_test_task, pvParameters)
{
	// const uint8_t data_to_send[] = {0x01, 0x23, 0x45,0x67, 0x89, 0xab, 0xcd, 0xef};
	// uint8_t data_to_read[] = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
    data_to_send[0]=0x01; // my add
    data_to_send[1]=0x23; // my add
    data_to_send[2]=0x45; // my add
	int ret, count;
	struct tegra_spi_xfer xfer = {
		.flags = BIT(TEGRA_SPI_XFER_FIRST_MSG) |
			 BIT(TEGRA_SPI_XFER_LAST_MSG),
		.tx_buf = data_to_send,
		.rx_buf = data_to_read,
		.len = ARRAY_SIZE(data_to_read),
		.chip_select = 0,
		.tx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.rx_nbits = TEGRA_SPI_NBITS_SINGLE,
		.bits_per_word = 8,
		.mode = TEGRA_SPI_MODE_1 | TEGRA_SPI_LSBYTE_FIRST,
	};
	(void)pvParameters; /* unused */
        for (count = 0; count < SPI_TEST_RETRIES; count++) {
            memset(data_to_read,0,ARRAY_SIZE(data_to_read));
            if (en_spi_flag > 3) // my add
            {
                ret = tegra_spi_transfer(&SPI_TEST_CONTROLLER, &xfer);
                if (ret)
                    printf("SPI TX/RX failed\r\n");
                else {
                    if (!memcmp(data_to_read, data_to_send,
                            ARRAY_SIZE(data_to_read)))
                        printf("SPI test successful\r\n");
                    else
                        printf("Received incorrect data\r\n");
                }
            }
            vTaskDelay(SPI_TEST_DELAY);
        }

	vTaskDelete(NULL);
}

(3) Modified code in l4t-rt/rt-aux-cpu-demo/app/ivc-echo-task.c:

int en_spi_flag; //my add

static void ivc_echo_task_process_ivc_messages(struct ivc_echo_task_state *state)
{
    int ret;
    const char *rx_msg;
    bool non_contig_available;
    int count, i;

    en_spi_flag++; // my add

    for (;;)
    {
        xSemaphoreTake(state->ivc_sem, portMAX_DELAY);
        count = tegra_ivc_rx_get_contiguous_read_available(
            state->id->ivc_ch, &rx_msg, &non_contig_available);
        xSemaphoreGive(state->ivc_sem);
        printf("IVC read count: %d\r\n", count);
~~~~~~~~~~~~~~~~~~~~~~~~~
}

In addition, I have not made any changes to the code

(4) I compile and flash the spe-fw partition by running:

make CROSS_COMPILE=arm-none-eabi- bin_t19x
sudo ./flash.sh -r -k spe-fw jetson-xavier mmcblk0p1

(5) After flash and reboot successfully, I run the following command three times (en_spi_flag > 3) to enable SPI:
sudo bash -c "echo 01234568 > /sys/devices/aon_echo/data_channel"

(6) The error logs occur:

--------------------------------------------------------------------------------
Exception: Data abort                                                           
DFAR: 0x00000000, DFSR: 0x00001c06                                              
PC: 0x0c491900                                                                  
LR: 0x0c48f8ac, SP:  0x0c4a5588, PSR: 0x6000001f                                
R0: 0x00000000, R1:  0x00000001, R2:  0x00000200                                
R3: 0x00000000, R4:  0x00000000, R5:  0x00000000                                
R6: 0x00000000, R7:  0x00000000, R8:  0x00000000                                
R9: 0x00000000, R10: 0x00000000, R11: 0x0c000001                                
R12: 0x0c485610                                                                 
--------------------------------------------------------------------------------

(7) I run the addr2line command:

arm-none-eabi-addr2line -e $SPE_TOP_PATH/rt-aux-cpu-demo/out/t19x/spe.elf -a -f 0x0c491900 0x0c48f8ac

0x0c491900
tegra_spi_transfer
/home/***/nvidia/r3273agx/spe/l4t-rt/freertos-common/code-common/spi-tegra.c:1127
0x0c48f8ac
tegra_spi_copy_spi_rxbuf_to_client_rxbuf
/home/***/nvidia/r3273agx/spe/l4t-rt/freertos-common/code-common/spi-tegra.c:374

Finally, my .elf file is as follow:
spe.elf (33.1 MB)

I am looking forward to your test results. @jachen

Hello,
Thanks for the detailed steps. I can reproduce the issue locally. It may take time to debug. More updates later.

Please try following patch:

--- l4t-rt.orig/freertos-common/code-common/spi-tegra.c	2021-01-16 06:26:03.000000000 +0800
+++ l4t-rt/freertos-common/code-common/spi-tegra.c	2023-03-23 14:49:48.680732285 +0800
@@ -1062,7 +1062,7 @@
 static int tegra_spi_handle_message(struct tegra_spi_ctlr *tspi,
 					struct tegra_spi_xfer *xfer)
 {
-	uint8_t dma_status;
+	uint32_t dma_status;
 
 	dbgprintf("tegra_spi_handle_message\r\n");
 	if (!tspi->is_curr_dma_xfer) {

Locally, though incorrect data shows, it will not crash.
Thanks for finding that issue.
Let me know if it can help in your side.

br
ChenJian