Welcome to the Nsight Compute Forum.
Can you please provide the full src code for us to compile and generate a report ? So we can check in details. Thanks !
In the NVIDIA GPU Architecture the GPU L2 Cache is the Point of Coherence for GPU Device Memory. The RTX 4070 has > 30 MB of L2 cache which is greater than the size of the output buffer. The write data has not be evicted from the L2 cache; therefore, the Device Memory write size is only 2.43 KB vs. expected ~16.78 MB.
If you change N to be 100x larger then the write value to Device Memory should be within a few percent of the expected value.