sharing global arrays slows kernel

Hi all. So, say you have a kernel which uses data from an array of integers on the device (ie int *arr = {1, 2, 3};). If I give each thread of that kernel an identical global copy of that array, I see a considerable speedup. However, if I just have a single global copy of arr, I save on memory but lose a lot of performance. Does anyone have an explanation for faster execution with a separate global copy of an array for each thread?

ie:

global void kernel(int *arr) {
int tid = threadId.x + blockDim.x * blockIdx.x;
int *my_arr = &arr[tid * 3];
}

versus

global void kernel(int *arr) {
int *my_arr = arr;
}