hi all, I have a code consisting of multiple kernels that need to be executed in sequence. first kernel does stream compaction on global memory using the shared memory as buffer. the following kernel sorts compacted stream (using thrust::sort). I am thinking of partially sorting (shared mem’ed) chunks in the first kernel to increase performance. in that case the following sort kernel has to be some merge sort variant that can merge chunks with arbitrary number of sorted elements. do you think it is worth it? is there a way to achieve this without hacking into thrust (and without coding the actual sort)?