Yes, the whole kernel is timed. But since it's a bandwidth-bound kernel, we are effectively measuring bandwidth. We could calculate the compute throughput of the kernel, but it will be low relative to the peak compute throughput of the GPU (since bandwidth is the bottleneck in this case).
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Inconsistent concurrent transfer speed | 21 | 1107 | April 17, 2023 | |
Performance test sharedmemory <-> globalmemory | 2 | 3931 | May 30, 2008 | |
An Easy Introduction to CUDA C and C++ | 48 | 1102 | July 19, 2018 | |
How to Optimize Data Transfers in CUDA C/C++ | 12 | 1143 | January 22, 2022 | |
A few questions on CUDA performance with pictures! | 6 | 3349 | January 10, 2009 | |
GPU/CPU precision comparison and Kernel instructions question | 5 | 669 | April 4, 2017 | |
Very newbie questions on synchronisation between GPU & CPU, and time measurement | 4 | 485 | December 17, 2017 | |
CUDA Newbie bandwidth question | 0 | 7890 | January 25, 2008 | |
Using bandwidthTest tool, D2D performance More than the official given bandwidth | 6 | 806 | October 28, 2022 | |
How to improve the performance of using CUDA IPC shared memory? | 5 | 87 | October 23, 2024 |