If all you’re doing is sorting 10^4+ buckets of ~10 elements then I doubt you need a GPU of that caliber.
For example, a dusty old GT 240 (GT200 w/96 cores) can sort 10,000+ buckets of 1024 32-bit keys at over 1000 Mkeys/sec. This means sorting 32K buckets of 1K keys takes 32 milliseconds.
A GTX 680 is ~10x faster (~3.25 ms.).
Your described problem is far simpler than this though as no merging is required. I can’t estimate the performance but it will be silly fast. Sorting 10 or 20 elements just isn’t that much work and you’ll probably be able to run at some high percentage of device bandwidth (GTS250 = 70 GB/sec.) since your CUDA kernel will basically resemble a memcopy()… which really means you will be running at PCIe bus speed if you’re round-tripping data between the CPU and GPU.
Unless you have some sort of hard real time requirement then I would skip thinking about TITANs until it works on your GTS 250 or a regular CPU. :)