Word Transfer time in GPU

Can you please help me in typical range of values for word transfer time in GPU.

You might want to define that a little more. Transferring a word from where to where?

device memory → device memory
host memory → device memory
shared memory → global memory

also, most transfers will have some kind of overhead/startup cost/latency, followed by a period when the transfer is occurring and during which it is more practical to define the amount of time it takes to transfer a byte.

And you may want to define the size of the word you are asking about.

The smallest end of the range considering all cases above is somewhere around 2*1/2TB/s, so 1 picosecond per byte.

The largest end of the range considering all cases above but ignoring latency is probably on the order of magnitude of 0.1 nanosecond.

Note that there are around 3 orders of magnitude difference between those estimates.

Including all possible/conceivable latency means the upper bound is arbitrarily long.

Thank you for your reply sir,
Can you please give the host to device memory transfer time ranges.

A typical PCIE Gen3 link used for GPUs (x16) can reach an achievable transfer rate of ~10GB/s. So that would be 0.1 nanosecond per byte. Some GPUs have slower links, and some have faster links. A typical PCIE Gen2 link is half as fast. A typical PCIE Gen4 link is twice as fast.

Sir, can I get any reference for these values.

Try your search skills. When I did a google search for “PCIE Gen3 Bandwidth” this is the first hit I got. The reason it shows 16GB/s instead of 10GB/s is because the first number is the peak theoretical value for the link. it is not achievable in practice.

Thank you for answers and suggestions sir,
In my work I implemented my work on Nvidia GPU Tesla T4 using Google colab.
I got good implementations results.
I have analytical expressions in terms of transfer time (tw) and multiplication time ™.
While matching the implementation results with numerical results , I am getting difference of numerical results with implementation results.
Can you please suggest me that where and how can I get the values tw and tm for GPU Tesla T4.

GPUs don’t have published numbers tm and tw that I am aware of. Similar to the previous discussion, you would have to decide how to arrive at these based on other published data. Haven’t you already come up with a number for tw? Why are you asking for that again?

The transfer rate across PCIe is not just a function of the GPU, but dependent on system configuration.

For a given system, it is possible to model the transfer rate fairly accurately by modelling total transfer time as consisting of a fixed overhead plus transport time that scales linearly with the number of bytes transferred. To do this, measure the throughput at various transfer sizes, then set up an overdetermined system of equations to solve for the two coefficients.

A worked example can be found in this previous question. In that case the fixed per-transfer overhead came out to 1.125 microseconds (the number on your system may or may not be similar).