The total load size from SOC memory is 8192 * 2 * 4 = 65536 bytes, and the store size to SOC memory is 8192 * 1 * 4 = 32768 bytes. So the total transferred data size is 65536 + 32768 = 98308 bytes.
The kernel execution time is 5.25us, yielding a bandwidth of (98308 / 1e9) / (5.25 / 1e6) = 18.725 GB/s. We can thus assert that this kernel has achieved 45% (18.725 / 41.59) of the maximum memory bandwidth.
However, profiling the kernel using nsignt compute shows a maximum bandwidth of only 9.96%, as shown in the picture below. I am confused as to which part of the analysis may be incorrect.
I have read the technical brief you mentioned, and I realize that the CPU and the GPU share the same physical memory, is it right?
If the answer to the first question is yes, I have a follow-up question: does data movement occur when using cudaMemcpy between the host and device? It should be more efficient compared to cudaMemcopy between different physical memory, correct?