I was experimenting with device memory reads through texture fetching. In the programing guide (section 5.4) one of the benefits is that “they are not subject to the constraints on memory acces patterns that global or constant memory reads in order to get good peformance”. Well I was checking the cuda file provided by MisterAnderson (Nvidia Topic) which provides a bandwidth check. Well I was changing the access pattern from
const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
const unsigned int idx = threadIdx.x * blockDim.x + blockIdx.x ;
hoping that the access is almost as quite fast as it is promised in the programing guide. The bandwidth droped down from read-only-texture 30 GiB/s to 8 GiB/s (GTS 8800). Similar behavior for the other datatypes and reading types.
Did I misunderstand something in the programing guide ? Why are device memory reads through texture fetching behaving similar to normal device memory reads if they are not subject to any constraints ?