Quick update: Michael Garland tells me that he has just published a pre-print of their IPDPS paper here, which has a better explanation of the algorithm and more recent performance data:
In the paper I also noticed a method by Intel guys, and GTX200 is on average 23% faster than the multicore CPU merge sort. But if they test their method on Intel 8 Core Xeon CPU, the CPU method may beat the GPU method.