How to Optimise random read/writes to Global memory

Hi, hope someone can point me in the right direction, I have a very Global Memory bound kernel. it requires accessing a large scratchpad per thread to pick pseudo random values from and write them back .

Currently I fill the global memory with about 2^16 copies of the key (522 x uint4’s) to match the global worksize amount , so each thread (64 seems the optimum) can have a key to work on.

The main kernel then processes just 3 x uint4 reads from the global memory, performs a calc on them, then puts them back into the large scratchpad. it does the read >> calc >> write 32 times. then at the end a final value is found and the key can be reset, by just re-freshing it from a constatnt key. and start again.

Whats been successful in increassing the speed:

putting in the some threadfence block just after the global reads so the kenels coalesce the global memory reads better. using shared memory to store the key mutations so they can be fixed up in another kernel.

Is there any methods to do the reads and writes to global memory simultaneously?

heres the code