I am using GPU for large data CT reconstruction, and I have a question about the data transfer rate inside GPU.
How to find (or calcuate) the data transfer rate from GPU global memory to GPU shared memory (data loading
using threads inside each block) ? Also, how to find the data transfer rate from GPU global memory to GPU
texture memory (via memcpy3D from global memory to cuArray)?
The bottleneck of my current implementation/algorithms is that I have to transfer data from GPU global memory
to GPU texture memory after every iteration of my algorithm, which is very time-consuming. Actually most
(above 95%) time has been spent on data transfer, rather than calculation. I tried to load the data from
global memory to shared memory to avoid the utilization of texture memory, but it seems no speedup. The data
transfer between global memory to shared memory is time-consuming too. I want to figure out a way to compare
the data transfer rate of both methods.
I am using Tesla C2050 on Redhad linux.
Any help would be highly appreciated.