Performance decrease on other NUMA node

Hallo Mellanox Community,

I observe a performance decrease by ~15% if I plug in my Connect X5 CX556A into a PCIe slot belonging to numa node1. On node0 I get better results.

But I remember a test some month ago I saw the performance penalty on node0 (on the same machine) and node1 works perfect. So I am assuming it is not caused by the HW (or by other PCIe components) itself.

Is their any Mellanox configuration in the driver about numa nodes ?

I am using a DPDK application.

My command on numa node1:

“l3fwd-bounce -v -w 0000:d8:00.0 --socket-mem=0,16384 -l 1,5,9,49,53,57 – -p 0x1 --config ‘(0,0,1),(0,1,49),(0,2,5),(0,3,53),(0,4,9),(0,5,57)’ -P”

My command on numa node0:

"l3fwd-bounce -v -w 0000:3b:00.0 --socket-mem=16384,0 -l 4,8,10,52,56,58 – -p 0x1 --config ‘(0,0,4),(0,1,52),(0,2,8),(0,3,56),(0,4,10),(0,5,58)’ -P "

All used cores are isolated (isolcpus).

Do you have any hints for me, how I can come to better results ?

I found a suspicious behavior in the frame balancing of the Mellanox card into the 6 queues. Queue 1 seems to get only few packets.

Command:

l3fwd-bounce -v -w 0000:d8:00.0 --socket-mem=0,16384 -l 1,5,9,49,53,57 – -p 0x1 --config ‘(0,0,1),(0,1,49),(0,2,5),(0,3,53),(0,4,9),(0,5,57)’ -P

core 1: received 124389578, sent 124389578, empty rx bursts 141061256

core 5: received 796841776, sent 796841776, empty rx bursts 1024321074

core 9: received 796847260, sent 796847260, empty rx bursts 1022132414

core 49: received 806257286, sent 806257286, empty rx bursts 1389903566

core 53: received 796918078, sent 796918078, empty rx bursts 1034221027

core 57: received 796925370, sent 796925370, empty rx bursts 1025396356

port 0: received 4118179348 packets (1054253913088 bytes); sent 4118179348 packets (1037781195696 bytes)

When I am NOT using CPU core 1 and its HT partner (49) the balancing works better and I achieve the expected performance!

Command:

l3fwd-bounce -v -w 0000:d8:00.0 --socket-mem=0,16384-l 5,9,11,53,57,59 – -p 0x1 --config ‘(0,0,5),(0,1,53),(0,2,9),(0,3,57),(0,4,11),(0,5,59)’ -P

core 5: received 806215772, sent 806215772, empty rx bursts 1109925928

core 9: received 796841011, sent 796841011, empty rx bursts 1114308276

core 11: received 796880281, sent 796880281, empty rx bursts 1105713499

core 53: received 806276520, sent 806276520, empty rx bursts 1107186920

core 57: received 796915472, sent 796915472, empty rx bursts 1114022822

core 59: received 796870944, sent 796870944, empty rx bursts 1111857481

port 0: received 4800000000 packets (1228800000000 bytes); sent 4800000000 packets (1209600000000 bytes)

How comes that ?

The performance drop was caused by an kernel module which utilized cpu1 on node1. After removing this module I saw the expected performance.

So in the end there was no suspicious frame balancing to queues. The bad counters for queue1 came from overloading cpu1.