How to Implement Performance Metrics in CUDA C/C++

Yes, the whole kernel is timed. But since it's a bandwidth-bound kernel, we are effectively measuring bandwidth. We could calculate the compute throughput of the kernel, but it will be low relative to the peak compute throughput of the GPU (since bandwidth is the bottleneck in this case).