In my task, I need to sort around 5000 arrays(list) each of which has around 1000 to 1024 elements independently. That is to sort a list in a block independently and do it in parallel for the 5000 blocks. There is no need to merge all sorted lists across blocks. Each such array is generated and resided in the shared memory of one thread block already.
I was wondering if there are any existing sorting methods (or implementations) to sort each list in the thread block quickly, ideally <10ms. I’ve tried thrust, merge sort(my own implementation), bubble sort(my own implementation), but they all have a running time of 350+ ms, which is far away from the goal.
Thank you in advance.