Strange behaviour of constant memory constant memcopy from device is not faster than host

I try to measure the time program spend on copy data to the constant memory. What I can see make me surprised.

Size Device-to-Const Host-to-Const
1 0.117 0.109
2 0.094 0.098
4 0.091 0.099
8 0.093 0.097
16 0.095 0.100
32 0.093 0.097
64 0.091 0.102
128 0.097 0.106
256 0.044 0.102
512 0.040 0.109
1024 0.044 0.121
2048 0.043 0.142

It is obvious that copy from device to constant memory is not faster than directly from host to constant memory if the size is less than 128 (int). I don’t understand why that happen, since data is on the graphic memory, it should be transfered to the constant memory in the rate of device to device memory that should be much faster than from host to device.

If we explain it as the call overheads in both case are the main reasons as the real transfer times are much less than the call overheads. Then I can not explain why size over 256 the time to copy data from device memory to constant memory drop down significantly, it become twice faster while it should be twice slower.

Can someone explain of what is going on here. Thanks.
copyToConst.cu (1.36 KB)