Why so low requested global throughput


I have allocated whole variables(about 5MB) on pinned memory with function cudaHostAlloc to save time coping data from host to device. 
But in the visual profiler testing, the requested global store or load throughput is very low, about 100MB/s. And global hit rate is zero. I'm confused about global cache requested and global cache executed are both off for each kernel. And I don't know what does it mean. The performance result is so far away from my expectation. If someone can help me, I will appreciate it.