High-performance __device__ sorting function

Hi, All

I am looking for the __device__ sorting function for an array with variable length, since I want a warp or block of threads to collaboratively sort an array with variable length.

I found most sorting function implementation online is about the __global__ sorting calling from the host side.

Is any high-performance implementation of __device__ sorting function?

Thanks!