P2P Tesla K80 gives 7 GB/s cudaMemcopy Bandwidth!!!

I have a Gigabyte MZ31-AR0 motherboard with 2 Tesla K80 cards connected at PCIE-Express 3.0 x16 (slots 1 & 6) and when i run a .simpleP2P test within one Tesla K80 card, i get a bandwidth (7 GB/s) of nearly half of what i expected (16 GB/s). ACS control is disabled, as well as IOMMU. If i run across different K80 cards, the bandwidth gets even less at 3 GB/s. The APM EPYC 7251 has 4 numa nodes and the GPU topology given are: PIX and SYS. Is that normal to have 50% bandwidth reduction within the same card and 25% reduction across NUMA nodes?? Any help would be highly appreciated.