Writing to Global Memory

I try to find the most efficient way to write to global memory. If a CODE_1 is considered there is coalesced write not only from global to shared memory, but also from register to global memory.

CODE_1

global void GlobTest( float* C, float* X){
int bx = blockIdx.x;
int tx = threadIdx.x;
int index = tx + BLOCK_SIZE*bx;

__shared__ int sh_x[BLOCK_SIZE];	
sh_x[tx] = X[index];
__syncthreads();

float a = 0;
a = sh_x[tx] * sh_x[tx];

// Writing to GLOB MEM
C[index] = a;

}

But I need someting more - I would like to use C[ New_addr[ tx] ] instead of C[index] (see CODE 2) because my algorith demands different addressing, and when I do this … the time of execution grows 12 times :"> !

CODE_2
global void GlobTestIndex2( float* C, float* X, int* New_addr){
int bx = blockIdx.x;
int tx = threadIdx.x;
int index = tx + BLOCK_SIZE*bx;

__shared__ int sh_x[BLOCK_SIZE];	
sh_x[tx] = X[index];
__syncthreads();


float a = 0;
a = sh_x[tx] * sh_x[tx];

C[ New_addr[index] ] = a;

}

I tried to use shared memory to keep table of permute addresses (New_addr), it helped but time changed from 12 (CODE_2) to 10 (CODE_3) times slower than CODE_1

CODE_3
global void GlobTestIndex4( float* C, float* X, int* New_addr){
int bx = blockIdx.x;
int tx = threadIdx.x;
int index = tx + BLOCK_SIZE*bx;

__shared__ int sh_x[BLOCK_SIZE];	
__shared__ int sh_ind[BLOCK_SIZE];	
sh_x[tx] = X[index];
[b]sh_ind[tx] = New_addr[index];

[/b] __syncthreads();

float a = 0;
a = sh_x[tx] * sh_x[tx];

C[ sh_ind[tx] ] = a;

}

Do you have any ideas how to speed it up ?
I would be grateful for any piece of advice,

Y.