Radix Sort

Hello, I have a problem. I am working on radix sort algorithm. I want to sort one million numbers. I chose a block size of 256 threads. I am currently at the stage that I have a field divided into 3,907 blocks, where the numbers are the tools are sorted in ascending order. .
Now I do not know how to effectively and parallel put these blocks together to make it as Sorting whole.
I studied a variety of literature, but it still can not understand how this algorithm do.

Thank you in advance for any advice and experience with this algorithm.

This should give you enough hints:
http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci360/lecture_notes/radix_sort_cuda.cc

Thank you. But a similar algorithm I used for sorting blocks(256 number) and it works good. But now I don´t know how to put the sorted blocks together effectively.