TK1 PCI-e link can not be estiblished

Hi,
When ambient temperature is low, the PCI-e link between TK1 and FPGA can not be estiblished.

console message for PCIe bus error:***********
tegra-ubuntu login: root (automatic login)

Last login: Sat Jan 1 00:00:28 UTC 2000 on ttyS0
Welcome to Ubuntu 14.04 LTS (GNU/Linux 3.10.40 armv7l)

  • Documentation: https://help.ubuntu.com/
    root@tegra-ubuntu:~# [ 27.266404] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
    [ 27.276746] pcieport 0000:00:00.0: device [10de:0e12] error status/mask=00000001/00002000
    [ 27.285147] pcieport 0000:00:00.0: [ 0] Receiver Error (First)

After lots of combination of experiments by HW/SW, heat-up/cool-down method at ambient/chip surface/kernal/substrate, we found:

  1. In defected chip when chip surface temperature is higher than 31℃, the PCI-e link between TK1 and FPGA can be estiblished.
  2. Good TK1 chip work properly at -10℃ ambient temperature even lower.
  3. The failure symptom follows CPU, we duplicated it by AB swap approach.
    Please help to find the root cause for this issue.

Hi, can you please swap failure chip to devkit to confirm if it is a chip issue?

Also, did you probe the waveform of PCIe signals to compare difference of “good” and “bad” TK1 chips? If yes, please upload screenshot here for check.

Hi,

  1. I will check if devkit is available and function ok first, then swap failure chip. It will take some time becasue we must have BGA rework process internally.
  2. PCIe Compliance was done long time ago. I will double check it.
    Thanks

Hi Sir,
Could you provide me one TK1 devkit?
Thanks

Please check with local sales to purchase one.

Hi,
May I know what’s next step for nvidia if the same failure happened on TK1 devkit?
May I send failure chip for your FA first?

If failure still happen on devkit, which might indicate chip issue, then you can send the failure ones to RMA.

Also please consider comment #3 to measure waveform of “good” one and “bad” one.

Hi,
I am worried that the chip will be damaged after repeated rework.
We need to mount this chip by IR reflow. If it’s failed, we need to unmount, then mount on devkit. Finally, unmount again and ship for your FA.
Is it ok to repeat rework on TK1?

Can you compare the waveform first? That might be helpful to confirm if signal related problem.

How did you do swap on your own boards? Devkit should be able to handle that too since your AB rework succeeded.

Hi,
Actually, We have no failure chip on my hand now and need to wait for next build.
I also have done PCIe compliance test long time ago and as my memory, it seemed no difference if compare to ok board at that time, then we also had done many temperature experiments. My OA was broken before and changed new one so some data was lost.

We can’t do FA if no chip…above suggestion is also workable to your new build if same failure happen.

Hi,
After I check Jetson schematic and how you some difference on PEX interface

  1. PEX[4:3] is used for FPGA on our M/B
  2. PEX3 is NC and PXE4 is used for mini PCIe slot on TK1 devkit
    PCIe design is different and shall we still rework failed chip on TK1 devkit? If yes, we will use the miniPCIe LAN card to verify PCIe issue.
    Any way, we will rework the failed chip with reboot issue first.

Per your above feedback, the issue is only related to temperature, right? If so, swap failed chip to devkit is better way.