Question on a kernel optimizations

I recently have a CUDA kernel which seems very “tough” to optimize as below:

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai )
    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < filter_size; idx += blockDim.x * gridDim.x)
        int index = (idx * ai) & (n-1);
        d_origx_remap[idx] = d_origx[index];

//Parameters were defined before
int permute[loops] = {29165143,3831769,17603771,9301169,32350975, ...} 
int n = 33554432;
int filter_size = 1783157;

for(int i=0; i<loops; i++)
    DataLayoutTransformKernel<<<dimGrid, dimBlock, 0, stream[i]>>>((cuDoubleComplex*) d_origx,    (cuDoubleComplex*)d_origx_remap+i*filter_size, n, filter_size, permute[i]);


The purpose of the kernel is to reorder the data layout of d_origx from irregular to regular (d_origx_remap). It is repeated with several times with each has different access stride (ai).

The challenge here is the irregular memory access pattern in referring the array of d_origx[index]. My idea is to use shared memory. But here for this case it seems very hard to use shared memory to coalesce global memory access.

So could anyone please give me some suggestions on how to optimize this kernel?