Mellanox OFED 5.5-1.0.3.2 - SEND Bandwidth Improves When Registered Memory is Aligned to System Page Size (4K). How?

OS - RHEL Centos 7.9 Latest

Operation:
Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables.
(Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6])

The registered memory region (500MB) is reused for all the 21 iterations.

The gain in Message Send Bandwidth when using aligned_alloc() (with system page size 4096B) instead of malloc() for registered memory is around 35Gbps.

with malloc() : ~86Gbps
with aligned_alloc() : ~121Gbps

Since the CPU is not involved for these operations, how is this operation faster with aligned memory?
Please provide useful reference links if available that explains this.
What change does aligned memory bring to the read/write operations?

Align buffers to cache line size can avoid read-modify-write.
It decreases the number of CPU cycles and memory accesses, and finally improves performance compared to unaligned memory buffers.