I am using GPU for large data CT reconstruction, and I have a question about the data transfer rate inside GPU.
How to find (or calcuate) the data transfer rate from GPU global memory to GPU shared memory (data loading
using threads inside each block) ? Also, how to find the data transfer rate from GPU global memory to GPU
texture memory (via memcpy3D from global memory to cuArray)?
The bottleneck of my current implementation/algorithms is that I have to transfer data from GPU global memory
to GPU texture memory after every iteration of my algorithm, which is very time-consuming. Actually most
(above 95%) time has been spent on data transfer, rather than calculation. I tried to load the data from
global memory to shared memory to avoid the utilization of texture memory, but it seems no speedup. The data
transfer between global memory to shared memory is time-consuming too. I want to figure out a way to compare
the data transfer rate of both methods.
I have finally figured out the problem. Data transfer inside GPU between global memory and texture memory turns out not to be the problem. The bottleneck lies in the reconstruction algorithm itself. Forget about this question. ^_^
The CUDA profilers have metrics for SM <> Device Memory and SM <> Shared Memory. Please run Nsight compute, nvprof, or NVVP to collect the information.