One core 100% IRQ some times

Hello!

I have Mellanox Technologies MT28800 Family [ConnectX-5 Ex] card on server with AMD EPYC 7742 64-Core Processor. I used all recommendations about tuning this NIC, but some time I see one core with 100% irq. Ususaly it CPU0 or CPU68, if I try set_irq_affinity_bynode.sh 1 eth0.

What to do?

ethtool -i eth0

driver: mlx5_core

version: 5.0-0

firmware-version: 16.29.2002 (MT_0000000013)

expansion-rom-version:

bus-info: 0000:41:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

Now NIC stops working:

Hi Андрй,

In AMD based servers, please assign the affinities to cores belong to the numa nodes with the shortest distance to the closes numa node connected to the NIC.

HowTo Find the Numa node connected to the network adapter

cat /sys/class/net//device/numa_node

HowTo find the numa nodes with the shortest distance:

numactl --hardware

^Please choose the nodes with distance 10 and 11.

HowTo find the relevant cores:

lscpu

HowTo assign the cores:

set_irq_affinity_cpulist.sh

For example:

set_irq_affinity_cpulist.sh 72-95 ens0

Notes:

  • Make sure irqbalance service is not running on the hosts
  • Hyperthreading is disabled

Regards,

Chen

Thanks for your answer!

cat /sys/class/net/eth0/device/numa_node

0

numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

node 0 size: 515906 MB

node 0 free: 3090 MB

node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

node 1 size: 516066 MB

node 1 free: 939 MB

node distances:

node 0 1

0: 10 32

1: 32 10

I made set_irq_affinity_bynode.sh 0 eth0

set_irq_affinity_bynode.sh 0 eth0

Discovered irqs for eth0: 563 569 575 581 587 593 599 605 611 617 623 629 635 641 647 653 658 663 668 673 678 683 688 693 698 703 708 713 718 723 728 733 738 743 748 753 758 763 768 773 778 783 788 793 798 803 808 813 818 823 828 833 838 843 847 850 853 855 857 858 859 860 861 862


Optimizing IRQs for Single port traffic


Assign irq 563 core_id 0

Assign irq 569 core_id 1

Assign irq 575 core_id 2

Assign irq 581 core_id 3

Assign irq 587 core_id 4

Assign irq 593 core_id 5

Assign irq 599 core_id 6

Assign irq 605 core_id 7

Assign irq 611 core_id 8

Assign irq 617 core_id 9

Assign irq 623 core_id 10

Assign irq 629 core_id 11

Assign irq 635 core_id 12

Assign irq 641 core_id 13

Assign irq 647 core_id 14

Assign irq 653 core_id 15

Assign irq 658 core_id 16

Assign irq 663 core_id 17

Assign irq 668 core_id 18

Assign irq 673 core_id 19

Assign irq 678 core_id 20

Assign irq 683 core_id 21

Assign irq 688 core_id 22

Assign irq 693 core_id 23

Assign irq 698 core_id 24

Assign irq 703 core_id 25

Assign irq 708 core_id 26

Assign irq 713 core_id 27

Assign irq 718 core_id 28

Assign irq 723 core_id 29

Assign irq 728 core_id 30

Assign irq 733 core_id 31

Assign irq 738 core_id 32

Assign irq 743 core_id 33

Assign irq 748 core_id 34

Assign irq 753 core_id 35

Assign irq 758 core_id 36

Assign irq 763 core_id 37

Assign irq 768 core_id 38

Assign irq 773 core_id 39

Assign irq 778 core_id 40

Assign irq 783 core_id 41

Assign irq 788 core_id 42

Assign irq 793 core_id 43

Assign irq 798 core_id 44

Assign irq 803 core_id 45

Assign irq 808 core_id 46

Assign irq 813 core_id 47

Assign irq 818 core_id 48

Assign irq 823 core_id 49

Assign irq 828 core_id 50

Assign irq 833 core_id 51

Assign irq 838 core_id 52

Assign irq 843 core_id 53

Assign irq 847 core_id 54

Assign irq 850 core_id 55

Assign irq 853 core_id 56

Assign irq 855 core_id 57

Assign irq 857 core_id 58

Assign irq 858 core_id 59

Assign irq 859 core_id 60

Assign irq 860 core_id 61

Assign irq 861 core_id 62

Assign irq 862 core_id 63

done.

Now all fine, most loaded core 64% irq. But sometimes one core loaded 100% irq, and trafic goes down from 45 Gbps to 30 Gbps

2 weeks all was fine. Now every evening the same picture: today 100% load irq CPU004. Helps to change the node for a few minutes set_irq_affinity_bynode.sh 1 eth0 after that 2 CPU load 100% CPU068 and CPU072, after that I change it again to 0 node set_irq_affinity_bynode.sh 0 eth0 and almost a day all fine until the evening peaks. What wrong?

The same picture with Mellanox Technologies MT28908 Family [ConnectX-6]

ethtool -i eth0

driver: mlx5_core

version: 5.3-1.0.0

firmware-version: 20.30.1004 (MT_0000000225)

expansion-rom-version:

bus-info: 0000:41:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

Every day such picture: