Bandwidth of reading data from global device memory

Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):


@each work item read one element from the 2-d matrix

__kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){

int col = get_global_id(0);
int row = get_global_id(1);
datatype res = i_matrix(row, col, w);     



@each work item read four elements from the 2-d matrix

__kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){

int col = get_global_id(0)*VF;
int row = get_global_id(1);
datatype res_1 = i_matrix(row, (col+0), w);     
datatype res_2 = i_matrix(row, (col+1), w);
datatype res_3 = i_matrix(row, (col+2), w);
datatype res_4 = i_matrix(row, (col+3), w);


The kernel program in OCL runs on Quadro5000, and CUDA v4.0.

The final results/bandwidth (GB/s) are listed as follows:

#wxh #kernel-1 #kernel-2

512x512 25.825288 28.577300
512x1024 45.789136 55.321340
1024x1024 48.436636 99.616003
1024x2048 105.405548 143.903362
2048x2048 116.601067 275.523666
2048x4096 161.155947 312.749974
4096x4096 182.113727 383.689990
4096x8192 191.141503 526.958676

The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?

Thanks in advance.

This is the performance data I tested on GTX280. The bandwidth also changes significantly with the data amount reading from global device memory.
#wxh #kernel-1 #kernel-2
512x512 15.442579 9.782132
512x1024 26.957073 54.196507
1024x1024 72.069586 31.487714
1024x2048 102.668141 117.791671
2048x2048 137.627116 126.131465
2048x4096 121.179170 176.805890