TX2 SPI Half Duplex Transfers Broken

I am trying to get a simple SPI protocol working on the TX2 (running L4T 28.2.1) and it is failing each time, somewhere in software. When I monitor the signals using a scope, everything looks correct, but I never get back the correct value in my RX buffer.

In order to communicate with the device, you send two bytes of instruction and read back a single byte. As a result, I am using the SPI_IOC_MESSAGE ioctl() to accomplish this.

In order to debug further, I set up a simple program using J21 on a development kit. Namely, I am expecting to see zeros read back, since there is nothing connected. I have verified that I have SPI set up properly (I edited the device tree, built spidev.ko, etc.) via the spidev_test.c loopback test. With MISO/MOSI tied together with a jumper, I get back the test pattern. With the jumper removed, my RX buffer is filled with zeros. This appears to eliminate the device tree and hardware as a potential culprit. However, it should be noted that spidev_test.c does a single full duplex transfer, while I need to do two half duplex transfers back to back in order to communicate properly.

Instead of seeing zeros in my RX buffer for the half duplex transfers, I see a strange, unexplained value. Namely, I always have part of my RX buffer corrupted and the number of corrupt bytes is equal to the number of bytes I sent in the previous transfer. For example, if I send two TX bytes and read one back, I will always get a weird value back. If I send two TX bytes and read out four bytes, rx_buf[0] and rx_buf[1] look good, but the remaining two bytes are set the some weird value. I have also tried cases where I only send one byte and read back many (say 32) bytes. In these cases, the last byte in the RX buffer is always set to some strange value, while the first (N - 1) bytes (where N is the number of bytes I read back) are always my expected value of 0x00. Note that the weird, unexplained values do not change for the lifetime of the program (e.g., if I read out 0xC0 as rx_buf[3], rx_buf[3] will always be set to 0xC0). Also I verified that data is being copied into the RX buffer each time since I set the RX buffer to a known test pattern prior to kicking off the ioctl().

Below is the code I am using to highlight the issue. Anyone can reproduce this bug as long as they have the standard dev kit set up for SPI on J21.

#include <assert.h>
#include <fcntl.h>
#include <linux/spi/spidev.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

static int do_xfer(const int fd,
                   const uint8_t* const tx_buf, const size_t num_tx_bytes,
                   uint8_t* const rx_buf, const size_t num_rx_bytes) {
  struct spi_ioc_transfer spi_xfers[2] = {{0}};
  // Most of the fields in the transfer structs are set to zero, so we use the
  // settings for speed and bits per word that were set when opening the SPI
  // device. CS will be held low between the two transfers. This allows us to
  // communicate with the device we want to talk to, since you first write an
  // instruction to it and then read back its response. For our particular
  // device, you normally write out two bytes and read back a single byte
  // (all while CS is held low). That said, this test program supports an
  // arbitrary number of bytes in either direction to highlight the issue we
  // are experiencing, where we always see the last N bytes in the RX buffer
  // set to some strange value (where N is the number of bytes that were sent in
  // the TX buffer). If N is greater than or equal to the number of RX bytes (as
  // is the case when we attempt to talk to our particular SPI device), the
  // entire RX buffer is set to some strange value.
  spi_xfers[0].tx_buf = (unsigned long)tx_buf;
  spi_xfers[0].len = num_tx_bytes;
  spi_xfers[1].rx_buf = (unsigned long)rx_buf;
  spi_xfers[1].len = num_rx_bytes;
  return ioctl(fd, SPI_IOC_MESSAGE(2), spi_xfers);
}

int main(int argc, char* argv[]) {
  int fd = open("/dev/spidev3.0", O_RDWR);  // J21 on the dev kit
  assert(fd != -1);

  static const uint8_t kSpiMode = 0;
  assert(ioctl(fd, SPI_IOC_WR_MODE, &kSpiMode) != -1);
  static const uint8_t kBitsPerWord = 8;
  assert(ioctl(fd, SPI_IOC_WR_BITS_PER_WORD, &kBitsPerWord) != -1);
  static const uint32_t kSpiClkHz = 500000;
  assert(ioctl(fd, SPI_IOC_WR_MAX_SPEED_HZ, &kSpiClkHz) != -1);

  // No matter how many transfers we perform, the "bad bytes" remain unchanged
  // throughout the lifetime of the program. For example, if we read 0xC0
  // instead of 0x00, we will always read 0xC0. Note that we do refresh the
  // RX buffer for each transfer, so these bytes are overwriting the test
  // pattern each and every time.
  static const int kNumXfers = 10;
  static const uint8_t kTxBuf[4] = { 0xAA, 0xBB, 0xCC, 0xDD };
  uint8_t rx_buf[4];  // Will fill this in later
  static const uint8_t kRxTestPattern[4] = { 0xDE, 0xAD, 0xBE, 0xEF };
  // Edit these variables to send/receive a different amount of data. Note that
  // the issue we are seeing is that the RX buffer will have its last N bytes
  // corrupted, where N is the number of bytes we sent (i.e., kTxBytes).
  static const size_t kTxBytes = 2;
  static const size_t kRxBytes = 1;

  for (int i = 0; i < kNumXfers; ++i) {
    // Set up a test pattern for each transfer in the RX buffer, so we can
    // monitor what bytes are changing as a result of the transfer.
    memcpy(rx_buf, kRxTestPattern, sizeof(rx_buf));
    assert(do_xfer(fd, kTxBuf, kTxBytes, rx_buf, kRxBytes) == (kTxBytes + kRxBytes));
    printf("RX BUFFER: ");
    // Always print the entire buffer to see what changed
    for (int j = 0; j < (sizeof(rx_buf) / sizeof(rx_buf[0])); ++j) {
      printf("0x%02X ", rx_buf[j]);
    }
    printf("\n");
  }
  
  close(fd);
}

Example output is as follows:

Send 2 bytes, Read 1 byte (remaining three bytes are the test pattern):
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF 
RX BUFFER: 0x1F 0xAD 0xBE 0xEF

Send 2 bytes, Read 4 bytes (test pattern overwritten each time):
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF 
RX BUFFER: 0x00 0x00 0xC1 0xFF

What’s even weirder is that every once in a while I do see that I get the expected value of 0x00 back.

I have looked through some of the kernel code so far (spi.c and spidev.c) and don’t see anything there that would explain what I am seeing. My next step is to look at spi-tegra114.c to see if this could be a problem with the SPI controller driver. I will report back if I find anything, but I would really appreciate if someone from NVIDIA would take a look at this code to see if there’s potentially a bug there that would explain what I am seeing.

I found the issue.

http://nv-tegra.nvidia.com/gitweb/?p=linux-4.4.git;a=commitdiff;h=9db175a3bc9f1764d5b446154c6e3f61c3120feb

Notice how rx_buf is advanced even when you aren’t doing an RX operation (i.e., u_tmp->rx_buf is NULL)?

I had made the invalid assumption that the version of spidev.c I was working with was the mainline version. Instead, it appears that the file was edited and not pushed to mainline. Based on the comments from the commit, I’m not entirely sure what problem the “fix”/change appears to be solving and in fact it has injected a bug that cost me a good amount of frustration and time (notice how my posts are being made on a weekend?).

What worries me is that this code was peer reviewed by two individuals (not to mention the original person editing the code) who didn’t catch this somewhat obvious bug. Also, this shows a lack of testing, since my use case (TX followed by RX) is fairly common (so common in fact, spi.c has a spi_write_then_read() function).

Also, why was this change never pushed to mainline? Presumably if the “fix” is “fixing” something, then the problem is an issue in mainline, and thus the patch should be pushed up so that others can benefit from the fix. I have also seen that mainline has a spi-tegra114.c file (the SPI controller driver) that differs from the one provided with the TX2, so it appears that NVIDIA is not pushing updated code to mainline Linux for some reason.