Weird memory write bytes reported by nv-nsight-cu-cli

I modified the sample 0_Simple/vectorAdd to add 5,000,000 elements.
However, nsight compute reports somewhat less dram write bytes. That is, it should give 20MB, while it reports 17MB.

With this command:

nv-nsight-cu-cli --sampling-interval 0 --metric dram__bytes_read.sum,dram__bytes_write.sum ./vectorAdd

I get:

[Vector addition of 5000000 elements]
==PROF== Connected to process 61664 (/scale/cal/home/jungwk/NVIDIA_CUDA-11.2_Samples/0_Simple/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 19532 blocks of 256 threads
==PROF== Profiling “vectorAdd” - 1: 0%…50%…100% - 1 pass
Copy output data from the CUDA device to the host memory
==PROF== Disconnected from process 61664
[61664] vectorAdd@
vectorAdd(float const*, float const*, float*, int), 2021-Jan-05 13:14:30, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes_read.sum Mbyte 40.00
dram__bytes_write.sum Mbyte 17.00
---------------------------------------------------------------------- --------------- ------------------------------

Why it is slightly less than 20MB? From more experiments I figured it out that, while the read amount is sane, about 3MB is always lost in dram write bytes, regardless of the size.
is it a normal behavior?

I tested under Tesla V100 16GB with nsight compute Version 2020.3.0.0 (build 29307467)