Thrust `zip_iterator` with arbitrary number of iterators

@striker159 Thanks to your suggestion, I found a way to solve my problem. You were correct that using thrust::cuda::par_nosync allows for efficient calls on multiple iterators.

This technique, however, requires a little rework for using thrust::sort_by_key. Indeed, a first sort would shuffle all the keys, and they cannot be used again on the remaining iterators. Furthermore, sorting many times potentially becomes inefficient as the same comparisons are done redundantly. My solution is to create an index vector using thrust::sequence, then sort it using thrust::sort_by_key. All the iterators can then be sorted one-by-one, and asynchronously, using thrust::gather where the map is this sorted index vector.

Maybe that can help others. Anyways thank you for your suggestion!