In-Place Memory Transfers (without temp buffers)

I am working on a bigger project that requires memory to be copied from one index to another.
Currently, I am doing this with two kerneles and a LOT of extra memory:

  1. given global_data
  2. kernel1 - assign index1 and index2… sets global_index1, global_index2
  3. cudaMemcpy - copy data into temporary buffers. temp_buffer = global_data
  4. kernel2 - transfer data from temp_buffer at index1 to global_data at index2

Current solution (pseduocode):
kernel2 {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int value = temp_buffer [ global_index1[idx] ];
global_data [ global_index2[idx] ] = value;
}

But this means that for a large problem, with say global_data = 500 MB, this requires that I have another temporary buffer just to store a copy.
It occurs to me, that if I could use local per-thread registers, then I could store the values in registers and copy it back to global mem without the additional buffer.
Thus avoiding the cudaMemcpy and the 500 MB lost space.

Something like:

kernel2 {
int idx = blockIdx.x * blockDim.x + threadIdx.x;

int value = global_data [ global_index1[idx] ]; // not using temp_buffer!

__syncthreads();

global_data [ global_index2[idx] ] = value;
}

Notice that all threads must have completed the data grab into local mem, otherwise some threads would read values incorrectly set by others.
They must all read first, then all write.
However, __syncthreads does not synchronize all blocks, only among threads inside a block.
Also, is there enough local register memory to handle the 500 MB?
It would be great to avoid the extra 500 MB temp storage.
Is there any way to solve this?

There is definitely not enough memory anywhere on the GPU to temporarily hold 500 MB of data. Your original solution with the temporary buffer is the only safe one unless there is some very specific structure to the pattern of global_index1 and global_index2 that would allow you to construct a faster solution without race conditions.