mPCIe Serial port card TX works but RX does not

With the original file I see this:
pcilib: sysfs_read_vpd: read failed: Input/output error

In the modified kernel I saw this instead:
lspci: Unable to load libkmod resources: error -12

The two are quite different. It looks like there is an i/o error in the original kernel, but you never get to see that on the modified kernel due to something in the module loading. The idea of modifying the kernel to get more debug output is correct, but something in the new build configuration and module content was incorrect.

Note that I am also assuming this code is not used during boot stages, and thus no module issue within an initrd is being considered; this is purely from once booted into the Linux kernel to your actual “persistent storage” (eMMC, NVMe, SD card, so on).

The writeb() function does not have a return value. This just writes one byte to an address. If that address is for a hardware driver, then this would typically be a physical address; when an MMU is used, then this might go through the MMU, but the virtual address might map directly to the physical address. If you can modify the kernel to printk information before and after that point, and see if this is the specific failure point, then you are closer. You might need to add “printk.synchronous=1” to the APPEND key/value pair in “/boot/extlinux/extlinux.conf” to force buffering to not run the printk into the logs out of order (normally it is buffered and not synchronous). You would be interested in the printk showing the address the write goes to (p + UART_EXAR_RXTRG); the value being written is hard coded, the real question is what address is this failing to write to (if that is the failure point).

Some interesting reference pages:
https://kernelnewbies.org/IoMemoryAccess
https://linux.die.net/man/1/systasks (search for writeb)

With the original file I see this:
pcilib: sysfs_read_vpd: read failed: Input/output error

vpd is an optional eeprom that is not present on this card. This error is seen on all platforms and doesn’t affect the serial operation. The only way this is an issue is if NVIDIA added code that keys off of it for some reason.

The writeb() function does not have a return value. This just writes one byte to an address. If that address is for a hardware driver, then this would typically be a physical address; when an MMU is used, then this might go through the MMU, but the virtual address might map directly to the physical address. If you can modify the kernel to printk information before and after that point, and see if this is the specific failure point, then you are closer. You might need to add “printk.synchronous=1” to the APPEND key/value pair in “/boot/extlinux/extlinux.conf” to force buffering to not run the printk into the logs out of order (normally it is buffered and not synchronous). You would be interested in the printk showing the address the write goes to (p + UART_EXAR_RXTRG); the value being written is hard coded, the real question is what address is this failing to write to (if that is the failure point).

The reads and writes are working fine. The offset in the printk refers to the offset within the first UART of the card. The issue is that the RX buffer never gets a character. When the LSR (offset 0x5) is read it reports no data in the receive FIFO (bit 0 is 0).
The trace shows that the UART is being setup what I think is correctly, and it is able to transmit to other machines. The RX does not work in loopback or when connected to another machine. I added the same trace to a working system running a 6.8.0 kernel on a desktop system. There are some differences in the order of operation, but the end result is the same.
Our mutual customer reported successfully trying it with a 4.9 kernel on another desktop machine. The Jetson is the only system that this occurs on. We’ve been making this chip work with various hardware and OS platforms for fifteen plus years. The issue isn’t something as simple as I/O not working for the whole device.

Some interesting reference pages:
IoMemoryAccess - Linux Kernel Newbies
https://linux.die.net/man/1/systasks (search for writeb)

This is something I could not answer. NVIDIA would have to reply as to whether that matters:

Assuming VPD does not matter, then the original kernel is working and the only error is from the new kernel (at least so far as messages in logs are concerned; actual TX/RX could be a different story). Even so we come back to the only error seeming to be from libkmod. Regardless of the actual issue, that message is the only debug information we have, and I still recommend rebuilding all of the modules to work with the new kernel configuration before continuing.

Not many people have logic analyzers capable of working with PCIe, but I’d be very curious to sniff the PCIe traffic. I don’t think it is the PCIe which is the cause. The driver to the card itself is different from the PCIe logic and hardware.

Is the offset correct? PCIe seems to be valid and working correctly, and this leads back to the driver for the card itself. If you are certain that the writeb() is correctly writing to the RX buffer then it implies something in user space (not PCI) is failing, but from what you are saying it seems rather likely that user space is not the issue. Writing to the wrong offset would account for a valid PCIe link and a known working driver failing. Changing architecture from other systems could change what the required offset is. Does this driver work on other systems of this same architecture, e.g., an RPi? Or are you speaking of success on a PC architecture? I’m thinking that maybe the offset is causing the write to appear correct, but that the write is actually to some location other than the RX buffer (and to reiterate, that offset probably changes on different architectures).

We know the address being written is correct, if simply because the Tx works fine. We can send data under any protocol under any baud rate we’ve tested.

I have not programmed this particular device, so I cannot say for certain (especially since there is no debug output other than a libkmod error). All I can do is give some examples and possibilities. It was stated that there is an EEPROM option which is missing, and so this itself probably has no meaning. In that case then we have absolutely no error messages at all.

I’ve not used setpci (see “man setpci”), but possibly you could compare some of its output on a working system to this system. Even increasing debug verbosity could help, but this assumes it is the PCI which is failing. The actual driver for this UART is what I would be tempted to start adding printk() to. PCIe itself has a lot of error checking already built in, and the previous verbose lspci did not indicate any error, including AER.

Btw, if a UART is receiving data under any of these conditions, then it won’t work:

  • Speed setting is too far out of spec from the TX.
  • Flow control is not properly set (it is possible to tell the RX to not function until CTS/RTS is satisfied).
  • Permissions are incorrect.

It is unlikely that loopback would fail for most of that, but it is possible (not common) for RX and TX to have different settings. For example, RX could in fact be set to a different baud rate, and possibly have flow control enabled while TX flow control is not enabled. Or perhaps the CTS/RTS is simply not being correctly forwarded. All contrived possibilities, but from what we’ve seen so far I don’t think the PCIe is the point of failure.

We get back to adding printk() statements to the UART driver, especially to print out whenever CTS/RTS flow control is configured and what the current state is, along with noting that data has arrived (regardless of whether the UART itself knows this we want to know if the driver sees that data).

The point about the IRQ not showing is interesting. However, consider that when the UART receives data that it would trigger an IRQ to PCI transfer; if the UART does not actually receive data, or if the data is rejected or too far out of spec in clock speed, then there would never be an IRQ trigger. This really needs more debug data to solve, and right now all I can say is that the PCIe does not show an error. That printk() in the right place is worth its weight in gold (not sure how much a C function weighs :P ).

EDIT: Does the PCIe card produce any /sys files for debugging? The manufacturer might need to be asked, or else if you know an address/ID, then there might be something you can find in “/sys”.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.