mPCIe Serial port card TX works but RX does not

I am trying to get a four port mPCIe serial card to work in a Jetson TX2, but RX is not working. The card has been verified working on other machines running several kernel versions, and another card has been tried in the Jetson with the same results. The behavior is the same for loopback and connecting to another system via null modem.
The card uses a 16550 UART and the built in 8250_pci driver.

TX works, but RX never gets data.
When I add trace to the driver, I see interrupts occur on RX, but LSR never shows data ready. LSR sometimes shows Rx Break Error.
I verified the baud rate generated by the Jetson TX using an oscilloscope.
We tried “pci=nomsi” on the kernel cmdline.

sorry, I am not familiar with that card. So is it a PCIe to jetson or uart to jetson?

Sorry. Here’s the card that is in the Jetson: mPCIe-COM-4SM Family - ACCES I/O Products

Here is the chip spec for the UART chip: https://www.diodes.com/datasheet/download/PI7C9X7958.pdf

The four ports get created and can send serial data to another machine just typing in screen, but they never receive in loopback or when connected to another machine.

– Jay

Hi,

If it is a PCIe to Jetson and lspci can detect this card, then it has nothing we can help on our side. The behavior of whether that UART pin would work or not depend on the vendor driver.

The devices work in any other Linux machine I put them in, and I see NVIDIA modified the 8250 driver which the card and the Jetson use when I diff the Linus 4.9 and Jetson kernel source.

I’ve tried disabling the serial console via command line, but wasn’t able to do so.

Just to be sure the PCIe is working, find the entry which applies when you run the command “lspci”. This will include an ID on the left for the slot, which will look something like “01:00.0” (the number will likely be different, but this shows the format). Then use sudo and get a verbose lspci (this command will log the result):
sudo lspci -s 01:00.0 -vvv 2>&1 | tee log_lspci.txt
(then attach the log_lspci.txt file to this forum)

Just to note, a typical issue for serial UARTs doing this is when one side uses flow control, but not the other side. Loopback rarely has that issue if the CTS and RTS wires are connected (it is possible for a serial UART’s TX and RX to have different settings, but that would usually only occur with intent). Another issue is if one UART has a different level than another (e.g., 1.8V and 3.3V), but this would not be the case for loopback. The above assumes an actual serial UART PHY pinout.

Once you go to RS-232 or RS-485, then the PHY changes. The requirements for voltages/levels are far different.

We don’t know if you are operating this as RS-232, RS-485, or if it uses headers for something like 3.3V UART PHY. We don’t know what speeds or other settings are involved, and the Jetson won’t have any control over this, but if the verbose lspci shows no error, then we can eliminate that as an issue. Considering there is probably a driver running in the kernel (on top of the PCI driver), there is also a possibility something in dmesg offers a clue (but to reiterate, this is unlikely to be anything the Jetson itself has control over).

If this is an actual serial UART PHY interface with something like 3.3V or 1.8V levels, and if it is not RS-232 or RS-485 (which have very very different voltage levels), you might be able to put an oscilloscope or logic analyzer on the CTS, RTS, TX, and RX; then see if CTS/RTS is triggering to know if flow control is running. Should CTS/RTS trigger in one direction of flow, but not the other, then it is likely a software setting; should the data on TX fail in one direction, then the issue would be something else.

log_lspci.txt (2.3 KB)

There are some in line replies below that we have worked out. Regarding which driver is in use. It is the 8250 driver located at ./drivers/tty/serial/8250/8250_pci.c in the Linux source tree. In current kernels the Pericom chip has been broken out to it’s own 8250_pericom.c file. Most of the interesting parts are in 8250_core and 8250_port.

In the past when I have seen this type of user experience it was because of interrupts not making it through the PCIe bridge, but in this case I see interrupts when I add trace to the driver. I also have to add code to the trace to not print anything when the serial console is hitting some of the same functions so there is some overlap between what the Pericom chip and the ttyS0 are using.

Just to note, a typical issue for serial UARTs doing this is when one side uses flow control, but not the other side.

Indeed! We’ve seen this many times, especially back when minicom defaulted to hardware flow control enabled. In this case we have carefully disabled flow control on both sides, in minicom, screen, and the other utilities we’ve tested with.

Once you go to RS-232 or RS-485, then the PHY changes. The requirements for voltages/levels are far different.
We don’t know if you are operating this as RS-232, RS-485, or if it uses headers for something like 3.3V UART PHY.

The mPCIe-COM-4SM in question supports RS-232, RS-422, RS-485, and RS-485 4-wire. We’ve tested 232, 422, and 485. We design and manufacture dozens of similar cards from the ISA era through PC/104, PCI, PCMCIA, PCIe, mPCIe, M.2 and more, so we’re quite familiar.

We’ve tested at various baud rates from 9600 to 115200.

If this is an actual serial UART PHY interface with something like 3.3V or 1.8V levels, and if it is not RS-232 or RS-485 (which have very very different voltage levels), you might be able to put an oscilloscope or logic analyzer on the CTS, RTS, TX, and RX; then see if CTS/RTS is triggering to know if flow control is running. Should CTS/RTS trigger in one direction of flow, but not the other, then it is likely a software setting; should the data on TX fail in one direction, then the issue would be something else.

We can observe the Tx or Rx lines in 232, 422, and 485, being familiar with the signal levels and single-ended vs differential signalling in the various modes, given our decades of experience with serial peripherals.

The entirety of our dozen or more PCI Express bus-based serial designs use the Diodes (nee Pericom) PCIe-to-8-port UART chip the PI7C9X7458 or its 4-port sibling. This is a PCIe 1.0 single-lane chip that provides up to 8 serial ports at up to 12 MHz; 921600 KBaud trivially.

Our mutual customer that is encountering the symptom we’ve described wants to purchase quite a number of both the Jetson and our serial board; I hope we can resolve this issue in a timely manner – it has been many months already as we’ve tried various avenues.

Have you ever updated the kernel? What do you see from “uname -r”? What do you see from “head -n 1 /etc/nv_tegra_release” (incidentally, the L4T release is what actually gets flashed to a Jetson, and is merely Ubuntu with a new name when NVIDIA driver content is added)?

It is worth noting that the AER error pointer is NULL, and so if we look only at the PCIe socket and logic (including signal quality), then PCIe appears to be working as expected. This is not a PCIe issue. Read on for something which PCIe may reveal, but does not control.

The serial module itself seems to have loaded without error, but the libkmod error makes me wonder if something else is wrong with modules that don’t match that kernel. It seems the “serial” driver is loaded, and then the libkmod error shows up. This error does not necessarily cause a failure, but it does mean that if one module depends on another module, then it is possible that the dependency never loaded. If this is the case, then most likely it is from a kernel configuration error (an error in the sense that modules may be compiled to a different spec than the kernel loading them).

Related is this (maybe not important):
pcilib: sysfs_read_vpd: read failed: Input/output error
…but this might not be a problem if it is a mission optional feature rather than some failure (e.g., maybe the driver works with optional EEPROMs, and none is present). It is likely (but not guaranteed) that this can be ignored.

You might want to add the output from “lsmod”. Did you install the driver? If so, how did you compile it? I’m only looking at the PCI side, and not at the serial driver. Anything related to customizing the kernel such that old modules might not load, or similar, is of interest. Some modules may trigger loading other modules in a chain, and I don’t know if the chain has issues.

Incidentally, is the “serial” driver itself something new? The 16550A is everywhere and well supported. The UARTs in the Jetson itself can emulate any of 16450, 16550, or 16550A, and only the 16550A is recommended. I’m assuming this is the serial driver present in the kernel and not one you’ve developed.

There is some trivia you should know about the built-in UARTs of the TX2 (and every Jetson so far as I know; this might not matter at all, but if working on UARTs, then you should know this): The Jetson UART, historically, is available during boot stages. This was originally the U-Boot bootloader, and more recent L4T releases (see the above question for “head -n 1 /etc/nv_tegra_release”; R32.x is the newest for a TX2) began removing the actual U-Boot code and placing its equivalent in the NVIDIA boot software, more or less replacing U-Boot in pieces (but still “compatible” with U-Boot). There has always been a serial console log available during these boot stages via the integrated UART. The driver used in the stock U-Boot is all that has ever been used in those boot stages, and this driver treats the embedded UART as a 16550A. This driver is completely standard during boot stages.

Once boot is completed, and the Linux kernel loads, the 16550A emulation continues on the serial console (the device tree can change between 16450, 16550, and 16550A behavior, but is always 16550A for the NVIDIA content). However, once in Linux, there is now a DMA version of the driver which can be run (non-NVIDIA UARTs cannot use this DMA driver). In “/dev” you will find some “/dev/ttyS#” UARTs, and some “/dev/ttyTHS#” UART interfaces. Those which name as a ttyS# use legacy 16550A without DMA; those which use the ttyTHS# naming are the “Tegra High Speed” serial UART drivers, running on the same UART, simultaneously, but using DMA. It is not recommended to switch back and forth between them.

The reason this came about is to avoid putting the DMA version in the bootloader and to use the stock U-Boot. Then, as logging transitions from boot stage to kernel, remaining with this driver implies there is a continuity of services for serial console logging during the transition from boot to kernel without resetting the UART.

I mention all of this because I don’t know which drivers you are using, and I’m hoping that your driver is for a 16550A and not a 16550. It’s also useful to confirm the “serial” driver is not NVIDIA’s UART driver (I doubt it is, but if kernels are being reconfigured, who knows?). I don’t know if a 16550 would behave badly if one of the drivers interacts somehow with a 16550A driver. This is rather unlikely, but I thought you should keep it in mind (I still don’t know if you added the “serial” driver, or if it is part of the stock NVIDIA kernel, nor do I know if you had to rebuild the kernel with a new configuration).

Mainly, the question is why this occurs, as this might be an issue for your driver to do its job:
lspci: Unable to load libkmod resources: error -12

Writing this while away from my machine. I will be back to it in a week.

The version is 4.9. It is whatever the last Jetpack that supports the TX2.

The driver is built into the kernel because that is the way NVIDIA configured the kernel. The serial driver is not new. It has been a part of the Linux kernel for as long as I remember. It would not surprise me if it’s been there since the 2.6 days. You don’t need to assume anything as I already provided the path to the source files in the kernel tree as well as the chipspec. The driver has worked with this chip for a decade.

We have even had customers using this card in Jetson devices in the past, but I don’t have the information needed to figure out a version to compare the current one to.

The information provided regarding the Jetson built in serial drivers is interesting to me, and maybe I can do something with it. NVIDIA modifications to the driver is my current theory, and I already mentioned that I see the NVIDIA port hitting some of the same functions in the driver as the Pericom ports. Do you know what the symptoms would be mixing DMA and non DMA? Do you know a way to disable the DMA feature of the driver and continue to use a legacy mode?

Regards,
– Jay

I’m not looking for just a kernel version. The output of “uname -r” contains something else related to configuration at the time of compile, so I need the actual output of “uname -r” which is (hopefully) slightly different than just 4.9. This information in part will tell us where the kernel expects modules to be found, and which modules are compatible. It is useful to know though that this is NVIDIA’s kernel and has never been changed or recompiled.

Assuming that the serial driver itself is part of the kernel, and that there is in no case any added module anywhere of any type, it simplifies things. If anything at all was compiled from source related to the kernel or a kernel module, this is very very important to know, and it changes a lot.

I don’t know all of the possible issues of mixing DMA and non-DMA on one particular UART. I could see the possibility of one driver interfering with the data that another driver uses, especially if there is are simultaneous attempts of both drivers; alternating between drivers seems less likely to be an issue, but you could be especially concerned if for any reason one driver was set to a different spec/setting than the other. It is undefined what happens when two drivers work simultaneously on the same UART.

Any UART with a device name of the format “ttyTHS#” will be the DMA driver (the “Tegra High Speed” driver); any UART with the device name of the format “ttyS#” will be the legacy driver. Only the integrated UARTs of the Jetson can use the THS driver. The Pericom device would never need worry about this unless some integrated UART is talking to it.

If you simply use the ttyS# driver, then the ttyTHS# driver is ignored. One can use the device tree “status” field of a UART node to disable a driver (you just have to identify which UART you are talking about since the device tree refers to them by their physical address). Do note though that the numbers (the “#” of a tty) often differ between the DMA version (the THS driver) and the legacy verison (the ttyS driver). For instance, there is no guarantee that /dev/ttyTHS1 is the same UART as /dev/ttyS1.

Note that in boot stages no THS driver is loaded.

I will be back to my system next week and will look into information provided. Thank you for the insights.

I am going to say again that there is no need to assume. I have told you multiple times that it is the built in driver and provided the path to the source in the kernel. It is the kernel source that is provided by NVIDIA. The only modifications I made were to trace activity without changing functionality.

– Jay

The above can break module load. This is not a question of functionality; modules not specifically built against the exact configuration can break, and the configuration changed. There are extra steps in that case.

Hi, John Hentges here from ACCES; I’m Jay’s supervisor and Director of Software Engineering here at ACCES.
The serial driver is not a module; the nvidia kernel configuration compiles it directly into the kernel, not into a module. As Jay stated days ago:

No modules were added, and, when the problem was reported by our mutual customer, and when we first reproduced the problem at our facility, nothing was compiled from source. Jay has, since then, added diagnostic output to the code and hence recompiled, but not as a module.

Recall: Tx works fine, in any protocol: another computer can receive data from eg minicom. This indicates that, even if it were a module, it is loading.

The Pericom PI7C9X7458 Family of UARTs has an I/O address range that is strictly 8 contiguous 16550A compatible register maps. (The chip is more like a 16850, but that is also 16550A compatible.) The chip also has a MEM BAR with ~0x400 bytes per UART, including flat FIFO access and other interesting things. The Linux Kernel serial driver doesn’t touch the MEM BAR contents: it just uses the 16550A stuff.

Jay pointed to the drivers explicitly, here:

Please, our mutual customer (they buy the mPCIe-COM-4SM from us, and the Jetson TX2 from you) wants quantities that I am sure we’d both love to sell them. Let’s see if we can make real progress on this issue.

Every module has a binary interface. Changing the integrated features of the kernel itself can change and interfere with some modules loading. One would normally rebuild all modules as well when rebuilding the main kernel, which means existing modules are suspect unless all of them were built against this new kernel configuration.

Within kernel configuration is partial control over the location where modules will load. When you run the command “uname -r”, the prefix of this comes from the actual kernel version, and the suffix comes from theCONFIG_LOCALVERSION setting. As an example, if your output from “uname -r” shows as “4.9.140-tegra”, then the kernel release was 4.9.140, and CONFIG_LOCALVERSION was set to “-tegra”. Kernels look for modules at:
/lib/modules/$(uname -r)/kernel

The kernel will look nowhere else for those modules. If the modules found in that location were compiled against a different kernel, then the binary load can fail. Someone who adds or removes a feature only by adding or removing a module will not change the binary interface; between changing things only as modules and preserving the old CONFIG_LOCALVERSION (indirectly, uname -r), there is the implication that all modules will load and work correctly.

As soon as the integrated (“=y” in config) features change, regardless of whether or not CONFIG_LOCALVERSION matches, then it becomes possible that older modules will have to be built against that new list of integrated features.

All I was wanting to know is if the “uname -r” was changed (NVIDIA uses “-tegra” on their configs), and if the integrated features differed. It turns out that integrated features do differ, but I was never able to find out if “uname -r” has changed. The part which inspired this question:
lspci: Unable to load libkmod resources: error -12

There are a lot of definitions of how code should behave when the code is correct and without error. I don’t know if the libkmod error has anything to do with this, but if modules and module loading are not valid, then this probably has to be addressed prior to one concluding that modules are not related to this. Were all modules rebuilt? Typically, if that occurs, then one would also intentionally change the CONFIG_LOCALVERSION since the “-tegra” configuration of the kernel itself has changed (e.g., name it after a change like setting it to “-serial”, purposely avoiding modules compiled against the original configuration). It is worth restating: There might not be a problem, but if there is a problem with any module load, then you cannot depend on the behavior which is supposed to occur from a module load.

I will also add that if an initrd is used, then some modules will exist within that initial ramdisk. Only modules related to boot and loading of the filesystem really matter there, but if there is any “early” issue, then one also has to consider whether the modules within the initrd need to be updated by using ones compiled against the new kernel configuration (due to the “=ychanges).

Do you know why this is occuring?
lspci: Unable to load libkmod resources: error -12


About all we can say so far is that the actual lspci shows PCIe is working as expected. Every PCI device (PCIe is just the PHY to PCI) has its own driver. From that information, since PCI is working, and does not have any error of its own (e.g., nothing indicates signal failure, nor checksum failures, and nothing shows in AER), the issue moves from PCI to either the driver for the device itself, or something in user space.

For drivers, if for example there used to be a modular driver for something like the serial, and then this became integrated directly into the kernel, and if the old module still tries to load, then there might be something undefined going on. But I don’t know yet if “uname -r” changed to avoid loading old modules, and I don’t know yet if something else is no longer compatible. It is, however, going to be either a driver issue or a user space issue and not a PCI issue.

As a contrived example, suppose a driver uses a different driver/feature to find checksums or to manage DMA. If the driver itself still works, and was integrated into the kernel, then one would think that all is well; however, if the changes cause some sort of issue for the checksum or DMA code, then you’re still going to get undefined behavior indirectly caused by the kernel change.

It is easier to just verify modules are set up correctly before debugging other issues. This is especially true when the drivers involved are well-known and normally reliable. That kmod error is the only clue we have, at least from kernel space. You might be able to get more information with something like a protocol analyzer on the input or output of the right locations on the mPCIe card.

Excellent, thank you. When Jay returns from his PTO (in a few days) he’ll provide the uname-r output.

However:

So I think we can rule out module-related issues as far as the “no Rx” symptom is concerned. (Though we will undoubtedly need to resolve any before considering any solution final for our mutual customer.)

Excellent, thank you. When Jay returns from his PTO (in a few days) he’ll provide the uname-r output.

root@ubuntu:~# uname -r
4.9.337
root@ubuntu:~# uname -a
Linux ubuntu 4.9.337 #63 SMP PREEMPT Mon Feb 10 20:13:19 PST 2025 aarch64 aarch64 aarch64 GNU/Linux

This “uname -r” indicates that not only has the base kernel changed, but also that 100% of all modules have to be recreated and installed. Note that a kernel always searches for modules here:
/lib/modules/$(uname -r)/kernel

Do you see significant content and subdirectories here?
/lib/modules/4.9.337/kernel/

If you still have content in “/lib/modules/...something-tegra/kernel/”, then this would be the old kernel configuration (NVIDIA uses “-tegra” for the CONFIG_LOCALVERSION).

This is of major importance: Were 100% of all modules also built against this new kernel and put in “/lib/modules/4.9.337/kernel/”? If so, did you start with a configuration such as the “tegra_defconfig” target, and then edit to make the configuration changes you wanted? The starting configuration method is important.

Here is the result when using the prebuilt kernel. The previous is from the kernel I was trying to debug with.

root@ubuntu:~# uname -r
4.9.337-tegra
root@ubuntu:~# uname -a
Linux ubuntu 4.9.337-tegra #1 SMP PREEMPT Mon Nov 4 23:40:52 PST 2024 aarch64 aarch64 aarch64 GNU/Linux

It is not a module. No module is part of the issue. It is built into the kernel and not a module because that is how NVIDIA configured the kernel. NVIDIA configured the kernel so it is not a module, and I didn’t change anything in the config. So it is not a module. A module is not what it is, and modules are not involved.

I got the config from /proc/config.gz of the prebuilt kernel.

Here is the link to the source of the driver inside of NVIDIAs github at the tag being used by the JetPack 4.6.6. Inside the kernel source.
https://nv-tegra.nvidia.com/r/gitweb?p=linux-4.9.git;a=blob;f=drivers/tty/serial/8250/8250_pci.c;h=a54dc0375ac681d5ddf53404e27ec66b426fd90b;hb=7f04f70641a9986271047bc060cfefaff3e8c0f7

It is possible that modules are not part of the issue. However, the only error given in the debug data was in fact a module issue which would normally never show up:
lspci: Unable to load libkmod resources: error -12

There is no other starting point as to why the RX port does not work. I don’t see any way to disprove or get past the kernel module being an indirect cause without fixing that first.

The URL to the driver only helps if you are going to modify the driver with something like a printk() to debug the RX. See:
https://docs.kernel.org/core-api/printk-basics.html

For example, you could add a printk() before and after this stating what the byte is:
1682 writeb(128, p + UART_EXAR_RXTRG);

There isn’t any other advice possible with the given debug information and only the libkmod error showing as a failure.

It is possible that modules are not part of the issue. However, the only error given in the debug data was in fact a module issue which would normally never show up:
lspci: Unable to load libkmod resources: error -12

jetson-lspci-builtin.txt (6.0 KB)
Attached is lspci output without the libkmod error using the built in kernel. User experience and oscilloscope measurements are the same as the debug kernel build.

There is no other starting point as to why the RX port does not work. I don’t see any way to disprove or get past the kernel module being an indirect cause without fixing that first.

The URL to the driver only helps if you are going to modify the driver with something like a printk() to debug the RX. See:
Message logging with printk — The Linux Kernel documentation

For example, you could add a printk() before and after this stating what the byte is:
1682 writeb(128, p + UART_EXAR_RXTRG);

Yes. This is what I was referring to when I said I added trace to the driver to see what was going on. As I said the register reads and writes all look like it should be working.

–Jay

jetson-open-updated.txt (11.4 KB)

There isn’t any other advice possible with the given debug information and only the libkmod error showing as a failure.

It should be clear that the libkmod error is not the problem, and that the problem is not related to modules. This driver is built into the kernel. It has been a part of the Linus tree for a long time, and the hardware is widely used.