Regarding the memory transfer of cuda, the actual transfer rate between Host and Device and device to device can be measured through cuda-z.
But is there any calculation for the theoretical transfer rate of this device to device?
In addition, the theoretical rate of the traffic between global and GPU chips should be described by the memory bandwidth. The shared bandwidth inside the GPU is obviously much higher than the memory bandwidth. But is there any way to measure the actual communication rate between global and shared?