I’m now using Tesla C1060 with 4GB GDDR3 global memory. In my application, the input is a large array of int (size = SIZE), and each thread needs to access N(=ele_per_thread) random elements and sum them up.
In the following code:
d_L: input array of int
d_sum: result array to store the sum of each thread
ran_num: input random numbers between (0~SIZE-1)
ele_per_thread: N above
And grid size = 30 (30 SMs in C1060), block size = 64, and N = 256
The interesting thing is that when SIZE < 256M (268,435,456), file size < 1GB, the running times are all around 2ms.
But when SIZE = 256M (268,435,456), file size = 1GB, the running time increases to 4ms, doubled!
And when SIZE = 512M (536,870,912), file size = 2GB, the running time increases to 6ms.
I can’t figure out the reason. Since these are all random memory accesses, the running time should be the same. However, in my case, it seems that accessing more than 1GB memory costs much more time.
Can anyone explain this? Thanks!
[codebox]global void
test_mem(int* d_L, int* d_sum, int* ran_num, int ele_per_thread)
{
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int i_ele = tid * ele_per_thread;
d_sum[tid] = 0;
for(int j = 0; j < ele_per_thread; j++)
{
d_sum[tid] += d_L[ ran_num[i_ele] ];
i_ele++;
}
__syncthreads();
};[/codebox]