Implementing Gather in Cuda

I am trying to implement the nearest neighbors algorithm in cuda in C. I have a query set array and a document array. I find out the similarity between each query element and document. if the similarity score exceeds a threshold I store in an array (local thread memory). After this I want to “GATHER” this array from all the threads into a single array , sort it and select the top n entries.

Could someone please help me with the gather operation? I want to gather all the threads results within a block. The result array size for each thread is variable. I considered cudappcompact but I am not sure as to how i can implement it

Thank you !