Multiple threads on one large dataset

I try to achieve a maximum performance in a task of parallel calculation on a large dataset.
Assume i have a large (much larger than an amount of shared or constant memory) array of floats, allocated on global memory with cudaMalloc.
Every thread executing kernel must execute some algorithm on that data. Algorithm needs to sequentially access all the array elements in a loop. The only difference between threads are some parameters of the algorithm. There is no possibility to transpose the task as algorithm itself is sequential. My current approach is to read directly from the global memory in every thread and it shows a very poor performance.
What is the best approach for solving such a problem on a GPU?

A synthetical example (calculate a sum of elements multiplied by a thread index):

__global__ void kernel(float* data, int dataSize)
  float sum = 0;
  unsigned int threadIndex = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = 0; i < dataSize; i ++)
    sum += threadIndex * data[i];


float* dev_data;
cudaMalloc((void**)&dev_data, numPoints * sizeof(float));
cudaMemcpy(dev_data, dataSource, numPoints * sizeof(float), cudaMemcpyHostToDevice);
kernel<<<blocks, threadsPerBlock, 0>>>(dev_data, numPoints);