RDMA Host<->Device performance during external network communications

Hi all,
I am working on a BlueField 1 with 25GbE network (MBF1M332A) which is connected to an X86 host over PCIe Gen3 x8.
During the typical ib_write_bw test between host and device a bandwidth of 6.8GB/sec can be achieved, which is obviously limited by the host PCIe bus.
When executing an ib_write_bw test to communicate with another host over the network a bandwidth of around 2.7GB/s can be achieved, limited by the 25GbE interface.

Now running both tests at the same time limits the bandwidth between host and device to 2.7GB/s per connection. So it is possible to run two ib_write_bw in parallel between from DPU to host and additionally 2.7GB/s from DPU to an external host.

My question now is, why a single connection between DPU and Host is throttled to the 25GbE? My guess is, there is an engine (RDMA or the eSwitch) which is metering/throttling the bandwidth to have the same as the outgoing interface, but is only activated when traffic is flowing externally. Or maybe some traffic management packets going back, but I haven’t found them yet.
Can someone shed any light on this?