That’s in fact quite normal. The number nVidia reports is the maximum theoretical bandwidth (bus width x frequency). That’s if the bus is 100% active for every clock cycle while the transfer takes place.
Unfortunately, theory and life rarely match up. Factoring memory latency, and any time the GPU spends recieving commands, the real life result you are getting takes shape.
I’m aware of that and in fact I do not expect to get exactly the theoretical bandwidth.
However, the test I made is using memcpy command which I expect to be extremely optimized. A kernel that implements such a copy operation is extremely short hence don’t use many commands.
Moreover, I run the test on a very large array (150MB). In such a case, the GPU is fully occupied and the latency should be completely hidden. I would like to know what exactly causes the slow down and why.
The device performs context switchs between blocks so it can hide the latency. Is it possible that this is what causing the delay?