The thrust sort backend has two implementations: a fast radix sort and a much slower but more general comparison sort.

Deep in the dispatch files there’s a line that compares the passed-in comparison predicate to the thrust-provided less<> template:
static const bool use_radix_sort = thrust::detail::is_arithmetic::value &&
(thrust::detail::is_same<StrictWeakOrdering, typename thrust::less >::value ||
thrust::detail::is_same<StrictWeakOrdering, typename thrust::greater >::value);

As you can see, it uses the radix sort implementation if the key is arithmetic (i.e. a built-in numeric type) and if thrust::less<> or thrust::greater<> is used. Although your predicate is identical in implementation to thrust::less<>, it’s not the same type, so the thrust dispatch mechanism is using the slower comparison sort.

It’s easiest and fastest if you just supply a single 32-bit unsigned integer key. Try to invoke the sort that way.

If you want to sort by val1-val4… so you have basically a 128-bit key (!), you should do 4 32-bit key/index sorts. Because radix sort is stable, sort the least significant key, then the next least significant key, etc. At the end, gather your data based on the indices that were lugged around.