Sparse matrix vector multiplication and Multi GPU

I am trying to implement a sparse matrix vector multiplication on multi-GPU.

In order to do this i need to comunicate from GPU to GPU on each iteration.
I use clEnqueueCopyBuffer to do this (i have also tried Mapped memory with same result)

Now using for example consecutive CopyBuffer to transfer i get a BandWidth near 5230 MB/seconds bandwidth.

Unfortunately this is not the real situation where we have 1 execution followed by n transfer (n = 1,2,4 )

Now on my situation the execution kernel take about 1 or 2 milliseconds followed by 4 transfer of 196608 byte.
In this situation the transfer need 18 millisecond, so a bandwidth of 42 - 43 MB/s, This mean that is all latency

Is normal to have a latency of 18 milliseconds ( LAN has smaller latency) on a server Running Rad Hat with a
PCI bridge: Intel Corporation 5520/5500/X58 I/O, 8 Tesla S1070 and 16 Xeon 5520 ?

I appreciate any answer or hint or something to understand why this bad performance.