Theoretical and actual values of cuda memory transfer rate

Recently, I encountered some problems in the process of learning cuda and would like to ask you some questions. As we know, using CUDA-Z can measure the actual transmission rate between Host and Device and Device to Device. But how should the theoretical rate of this device to device be calculated? In addition, the theoretical rate of the traffic between global and GPU chips should be described by the memory bandwidth. The shared bandwidth inside the GPU is obviously much higher than the memory bandwidth. But is there any way to measure the actual communication rate between global and shared? I very much hope that you can help me answer these questions or provide some information in this regard. thank you very much.

The theoretical transfer rate is the product of the memory interface width, the memory interface clock, and a memory-type specific multiplier (a power of two, e.g. 2 for DDR3). Note that memory clock for a given GPU generally fluctuates based on power management state, so one would have to find the maximum memory clock by running a memory intensive GPU task and observing the clock rate. I have not seen memory clock being influenced by clock boosting mechanism, but I cannot exclude the possibility of a boost-able memory clock on some GPU.

Since the theoretical transfer rate is not achievable in practice, what is usually of interest (e.g. for a roofline model) is the maximum bandwidth that can be achieved using the most favorable access pattern. For the memory subsystem of modern CPUs and GPUs that is typically on the order of 80% of theoretical.

I am not sure CUDA-Z is a particular reliable way of determining the maximum achievable device-to-device throughput. In the machine I am typing on right now there is a tiny GPU, a Quadro K420. Two different versions of CUDA-Z report device to device memory copy speed as 10026 MiB/sec and 10031 MiB/sec, however with my own program I measure a throughput of 28.51 GB/sec during copying (so the copy transfers data at half that rate, 14.25 GB/sec).

First of all, I want to thank you for your patience.

Question 1: Based on your answer above, can it be considered that the theoretical transfer rate of device-to-device is the video memory bandwidth?

Question 2: I think the video memory bandwidth is the theoretical upper limit of the transfer rate from the global memory to the GPU chip (for example, shared memory). Is this correct?

Question 3: According to your answer, I still don’t know how to measure the transfer rate from global memory to shared memory. Do you have any method?

I sincerely hope to have further communication with you, thank you very much!

Are you here?

While not directly addressing your questions, you may find useful related information here and the bibliography may offer direction on how to perform the testing:

Thank you very much for the documentation, I got some useful information from it. In addition, can you tell me where you got this document, or do you have similar documents? Looking forward to your reply, thank you very much.

I got it from the same place as you. There is another paper with the same name, but “Volta” instead of “Turing”, which is almost the same.