I’m now using Tesla C1060 with 4GB GDDR3 global memory. In my application, the input is a large array of int (size = SIZE), and each thread needs to access N(=ele_per_thread) random elements and sum them up.

In the following code:

d_L: input array of int

d_sum: result array to store the sum of each thread

ran_num: input random numbers between (0~SIZE-1)

ele_per_thread: N above

And grid size = 30 (30 SMs in C1060), block size = 64, and N = 256

The interesting thing is that when SIZE < 256M (268,435,456), file size < 1GB, the running times are all around 2ms.

But when SIZE = 256M (268,435,456), file size = 1GB, the running time increases to 4ms, doubled!

And when SIZE = 512M (536,870,912), file size = 2GB, the running time increases to 6ms.

I can’t figure out the reason. Since these are all random memory accesses, the running time should be the same. However, in my case, it seems that accessing more than 1GB memory costs much more time.

Can anyone explain this? Thanks!

[codebox]**global** void

test_mem(int* d_L, int* d_sum, int* ran_num, int ele_per_thread)

{

```
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int i_ele = tid * ele_per_thread;
d_sum[tid] = 0;
for(int j = 0; j < ele_per_thread; j++)
{
d_sum[tid] += d_L[ ran_num[i_ele] ];
i_ele++;
}
__syncthreads();
```

};[/codebox]