How to calculate the data transfer rate inside GPU

I am using GPU for large data CT reconstruction, and I have a question about the data transfer rate inside GPU.
How to find (or calcuate) the data transfer rate from GPU global memory to GPU shared memory (data loading
using threads inside each block) ? Also, how to find the data transfer rate from GPU global memory to GPU
texture memory (via memcpy3D from global memory to cuArray)?

The bottleneck of my current implementation/algorithms is that I have to transfer data from GPU global memory
to GPU texture memory after every iteration of my algorithm, which is very time-consuming. Actually most
(above 95%) time has been spent on data transfer, rather than calculation. I tried to load the data from
global memory to shared memory to avoid the utilization of texture memory, but it seems no speedup. The data
transfer between global memory to shared memory is time-consuming too. I want to figure out a way to compare
the data transfer rate of both methods.

I am using Tesla C2050 on Redhad linux.

Any help would be highly appreciated.

Yongsheng

I have finally figured out the problem. Data transfer inside GPU between global memory and texture memory turns out not to be the problem. The bottleneck lies in the reconstruction algorithm itself. Forget about this question. ^_^

Hello, have you found a way to calculate the transfer rate between global memory and shared memory

The CUDA profilers have metrics for SM <> Device Memory and SM <> Shared Memory. Please run Nsight compute, nvprof, or NVVP to collect the information.