Implementing Radix Sort over Multiple Devices

Has anyone tried implementing the Radix Sort algorithm over multiple devices? I have adapted the Radix Sort from the SDK particles project and it works fantastic on one device, but I’m not sure how it can implemented over multiple devices?

If anyone has tried was it successful?

I think it would be quite simple to shift particle indices so that devices work on portions of data but would there need to be some communication and synchronisation between devices to decide on cell sizes?