motion estimation function

Hey guys,
i’m trying to code a motion estimation for medical image volumes as a cube matching algorithm with SAD/SSD. My approach is to let each CUDA block treat one candidatecolume. But the problem i’m facing in the end is that usually on CPU implementation you compare serially each SAD/SSD and if its better suited use its motion vector. But on the GPU it can happen concurrently, so if a SAD/SSD is checked, the thread cannot be sure that it has the right SAD/SSD value (read-modify-write race condition). If it would only be about the SAD/SSD value i could use atomics, but i also need to safe the motion vector which has been computed…anybody has an idea how to solve that problem?

would appreciate any help, thanks

adel