I am recently working on a MergeSort algorithm on CUDA. I have some doubts to develpe the logic about how to do it parallaly. I already implemented CPU version of it.
My input arraysize could be from 1 till 2^20. On CUDA how could I take a part of Large array and then sort it and then make array size big in fractional steps and sort the elements?
For example, if I have my array size of 32 elements. At first I take 8 threads to sort 4 elements each and then I make my array slice to 8,16,32. How to effectively do this on CUDA?