perhaps counter-intuitively, the sorting of columns doesn’t lead as easily to parallel acceleration as the sorting of rows. There is considerable work done on CUDA sorting where the data is adjacent. If you can convert your column-sort into a row-sort, the problem is easily satisfied with a single thrust call (a segmented sort, i.e. sort-by-key) or via cub, and it might be faster than what you have now.
I’m not suggesting you transpose the data to make this possible, but if you can reorganize your data organization/storage concept to make this change, without hampering other work you are doing, then that may be another avenue to explore.
Although maybe not convenient to use here, cupy sort probably knows how to do a sensible job of sorting columns of an array.