And my result of profiling
In the picture above, the throughput for ‘Memcpy’ is written, but the throughput for one kernel is not written.
What do I have to do to get the throughput of the kernel?
And
The meaning of ‘throughput’ is global memory throughput, right?
also, global memory throughput means effective global memory bandwidth in https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#effective-bandwidth-calculation ?