And my result of profiling
In the picture above, the throughput for ‘Memcpy’ is written, but the throughput for one kernel is not written.
What do I have to do to get the throughput of the kernel?
The meaning of ‘throughput’ is global memory throughput, right?
also, global memory throughput means effective global memory bandwidth in https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#effective-bandwidth-calculation ?