I have a PCIe based framegrabber that I’m using with TX1. When I start the TX1 after inserting the PCIe card, it boots up for 10 odd seconds and just hangs at the Ubuntu login screen. I have to reboot it until it doesn’t hang so abruptly. This hang/crash happens very frequently. Here’s the kernel log I got through serial console.
At times, it doesn’t even show the login screen and hangs up before it, just after the boot up. In those cases, there’s no error message that’s there in the post below, it just timeouts and reboots. Any suggestions on how to debug this would be helpful.
Is it possible for you to put this PCIe card in another Linux computer, and show the output for this from “sudo lspci -vvv” (this is a lot of output, you’d only care about the output specific to this card)?
This is L4T 24.1. Currently, I don’t have access to another computer. I can share the output as soon as it’s possible. Meanwhile, I can share the output of sudo lspci -vvv on my Jetson TX1 (when it boots and runs successfully). Here’s the relevant log:
So far from the soft lockup plus a successful lspci, it seems there may be a threading issue regarding the driver to the card. The card itself does not seem to use any standardized device class, so the driver is specific to that board (the device is not a generic/general/standard class so far as driver handling is concerned). According to this, PCI handed this device to driver “tw6869”. Beyond that I do not know what is going on in the driver. Was this a third party driver?
That might be possible. I am using this driver: https://github.com/FrankBau/tw6869
The driver doesn’t work out of the box as I kept getting a resource collision error on dmesg.
So I also had to apply a patch to the drivers/pci/quirks.c file to make it work:
It’ll be very difficult to know what’s going on without an actual device and the driver source as edited…even then it may not be easy. From what I’ve seen in the URL you gave, the driver author may be able to make a suggestion. Most likely the driver has been functioning on a typical x86_64 desktop distribution, so something slightly different in design might be needed for Jetson.
If lucky, the author will have worked on the code which causes the “Bad mode in Error handler detected, code 0xbf000002” message…he will possibly be able to go straight to the part of the code which generated this and be able to adjust. This may not “fix” the driver, but it would prevent the driver from the soft lockup and better information would be available for any other issues.
I couldn’t say for sure, but odds are high it is with the driver itself. The question is what base kernel version was the driver designed for? For example, if it is running normally on a 4.x version kernel, there’s a lot that might go wrong putting it in a 3.x kernel. Just going from x86_64 to aarch64 would have effects on many drivers even if they are from the same base kernel version.
The driver and the quirk works for IMX6 board. I actually found that patch on the IMX6 forum, and they’re using the same driver. Here’s the relevant post, with the driver and patch link in the comments:
IMX linux kernel is probably 3.10.x, so it shouldn’t be a kernel issue. And the architecture is also ARM based.
Correct me if I’m wrong, but I think the IMX6 is 32-bit ARMv7, while JTX1 is 64-bit ARMv8-a. Despite a lot of similarities, they are still different architectures (the ARMV8 can enter a 32-bit compatibility mode to execute the older ARM 32-bit, but normal operation is a different instruction set). There are likely some differences in DMA between these two, but I couldn’t tell you what to look for. I’m not sure what would be required for a “proper” port to ARMv8-a.
Our company also manufactures a TW6869 based frame grabber “C351”, or DarkCrystal SD Capture Mini-PCIe Quad, which can capture 4 SD video streams simultaneously.
I have tested C351 on Jetson TX1 running L4T R24.1 with our own proprietary driver, and it seems to work fine. Here’s a “lspci -vvv” output of our C351 frame grabber on Jetson TX1, in case it helps.
This probably wouldn’t help for the other frame grabber, but it is still interesting to compare. Your lspci info will be truncated though unless called with “sudo”. Then you could see things like how fast data lanes are running.
Did you make any changes in the kernel for making it work? Or does it work as expected by just compiling from source and loading it on the TX1? Did you have to add any quirk like I did to make it work because it didn’t recognize the TW6869 class without it.
Interesting…both the original device and the listed working device are “Intersil Techwell Device 6869 (rev 01)” on lspci. So even if they are different devices, they use the same “chipset”. They also both show driver as “Kernel driver in use: TW6869”. However, the original poster had to compile the driver from an outside source, and ended up with a probable threading issue…@jkjung, is there any way the origial poster could get a compiled kernel module driver from your version which he could try? I would bet that with that plus the kernel command line edits you gave his camera would work.
On an unrelated note about PCIe which I’ve noticed and wonder about for gen. 1 devices sometimes not showing up on lspci, it seems de-emphasis may be handled incorrectly in the root complex. In gen. 1 de-emphasis is fixed at -3.5dB, it wasn’t until gen. 2 that -6dB was added as an option. The basic idea seems to be that the increased de-emphasis would be used to support longer traces for PCIe devices which were physically further from the root complex. A gen. 1 device would be “hard coded” to behave with expectations of -3.5dB. A motherboard supporting gen. 2 would have a fixed -3.5dB for a PCIe slot close to the root complex, and would have a fixed -6dB de-emphasis for PCIe slots further away (mixing -3.5 and -6dB wouldn’t break things, but correct matching would improve signal). It wasn’t until gen. 3 that de-emphasis became “adjustable” with the endpoint participating in discovery of best de-emphasis. What I’m wondering about is whether the eye pattern is actually better at -6dB for this slot which is close to root complex, or if -3.5dB would be better? If it turns out that -3.5dB is better, then using -6dB could be part of the reason why spread spectrum would cause some of the cheaper PCIe cards to not quite show up.
I’m still confused how the driver works without adding a patch to the drivers/pci/quirks.c file. Because there’s no entry for TW686* PCI cards there and hence, no class would be assigned to those PCI cards. Which is exactly what I had an issue with earlier. (I was getting a PCI type 0 class 0 error on dmesg while loading). Here’s the quirk finally added to the Linux kernel in a later version:
I have it working now, somehow. I built the driver as a module so that it doesn’t load with the kernel. The system boots up perfectly without any errors (yet). Then I insert the module with an insmod and it’s working as expected. I still haven’t tested it’s robustness but it seems to work this way. I’ll probably add the insmod command to a startup script so it loads automatically after bootup. Thanks for all your help @linuxdev and @jkjung.