CUDA to implement a parallel tempering optimization algorithm

Hello I am new to CUDA and as such I am going to the tutorials. My goal is to write a CUDA implementation of parallel tempering to minimize a cost function in my research. Is there already such a code in existence? If not, I am more than willing, maybe even excited, to write one. (I am a physical chemistry graduate and I see many uses of GPGPU).

In parallel tempering one runs N replicas of your system in parallel, each at a different temperature. After N steps you swap states between two temperature adjacent replicas, accept/reject swap and then you continue. As the degrees of freedom in your system increases the more replicas are needed such that the acceptance ratio of swaps is nonzero. It seems to me that this problem is well suited for GPGPU. One can map each replica to a thread and thus run 1000’s of replicas. Am I being naive?

My Pseudo-Code for the Parallel Tempering follows:
=> Read in Data on Host
=> Transfer data from Host to Device

=> One Device
=> Initialize State
=> Calculate Cost Function
=> Make MC moves
=> Recalculate Cost Function
=> Accept/Reject new state based common MC criterion
=> After N MC moves try to swap replica/threads (states)
=> Sync Threads
=> Place Cost and State descriptor in shared memory
=> On one of the thread say index 1 determine which replicas (states) to swap
=> Accept/Reject swap base on the MC criterion
=> Sync Threads
=> Propagate each thread for another N MC steps and continue as above

Any general and technical advice from the community on how to maximize efficiency of the code and such?
Thanks in Advance

It seems to me that this problem is well suited for GPGPU. One can map each replica to a thread and thus run 1000’s of replicas. Am I being naive?
It seems so to me too :)

I can’t claim to understand you pseudo-code anywhere near fully, but the only problem I can see is the swapping of data between threads. Using global memory would be slow, and shared memory limits you to swaping within the same block.
Perhaps otherwise well optimized code would be capable of hiding the global load. Or you just use some kind of lookup table to list where each thread is supposed to look for it’s data. That way the threads won’t actually need to move any data except a pointer when they swap.

On a side-note: Well stated questions are rare. I like them ;)