But this feels like too many if-statments. And it doesn’t really sort things that well.

I want to treat each uint4 as a really big number, basically. Using this algorith, some sample output is :

for (int i = 0; i < hf.size(); ++i)
{
const uint4 x = hf[i];
std::cout << x.x << x.y << x.z << x.w << std::endl;
}
=>
1297995717402077544712960039091296003909
1871294381966247556202119643202119643 // shouldn't this value be "smaller"?
1871294381966247556202119643202119643
1896240623155948283137955853523795585352
2049656309427418867014147655021414765502
2049656309427418867014147655021414765502
2190441536208074642621427028732142702873
2443920497413065313011105256511110525651
254049324260073726413352427991335242799
3640923726229267969420487107022048710702
3774043292342266168324004503752400450375
3774043292342266168324004503752400450375

All I really need out of this algorithm is to keep like entries next to each other so I guess the order doesn’t really matter but I’m not a particularly large fan of what my comparator is doing at the moment.

default behavior of std::cout << of a uint quantity isn’t going to print leading zeros, right?

That could possibly explain your “apparently misordered” output. I’m pretty sure you can modify that with some precursor code for std::cout

I’m pretty sure this statement is redundant:

if (lhs.w == rhs.w)
{
return false;
}

and could simply be deleted with no functional change, but the compiler might already be doing that for you.

And even if a vector type for unsigned long long is not available (not sure if it is) you could play around with recasting of the uint4 into a sequence of two unsigned long long quantities, and reduce your functor from 4 levels of testing to 2. Whether that is worth it performance-wise, I don’t know. This last suggestion will depend on a proper little-endian storage order in .x, .y, .z, .w

It is not really clear to me how you would like the uint4 variables to be ordered. Is the intention to treat each uint4 as an unsigned 128-bit integer and order accordingly? If so, the easiest way would be to perform a 128-bit integer subtraction and examine the carry-out of the result. You could use the following code for 128-bit integer addition as a starting point:

njuffa, I’ll have to look into your solution a bit later but it seems promising. I can’t believe I actually have a need for learning some assembly. This is the first time for me and it seems really, really cool.

One can perform a 128-bit integer subtraction at C level, but it will not be nearly as efficient as when using inline PTX since C/C++ do not have a native notion of a carry flag. You may want to give this a try:

/* .x = least significant bits, .w = most significant bits */
typedef uint4 my_uint128_t;
/* uint128_lt determines whether argument 'a' is less than argment 'b'
returns 0 if a >= b, 1 if a < b
*/
__device__ int uint128_lt (my_uint128_t a, my_uint128_t b)
{
int res;
asm (".reg .u32 tmp;\n\t"
"sub.cc.u32 tmp, %1, %5;\n\t"
"subc.cc.u32 tmp, %2, %6;\n\t"
"subc.cc.u32 tmp, %3, %7;\n\t"
"subc.cc.u32 tmp, %4, %8;\n\t"
"subc.u32 %0, 0, 0;\n\t" // capture carry out: 0 or 0xffffffff
"and.b32 %0, %0, 1;\n\t" // create clean boolean
: "=r"(res)
: "r"(a.x), "r"(a.y), "r"(a.z), "r"(a.w),
"r"(b.x), "r"(b.y), "r"(b.z), "r"(b.w));
return res;
}

If the keys are effectively uniformly distributed, the first word is enough to distinguish ordering with an exception of only 1 in 4 billion. So optimize for that case instead of doing expensive 128 bit subtractions every single time. You’ll rarely get divergence (like less than one in a billion times!) so there’s less worry about the ugly chain of tests.

This is a cleaner way to write the comparison function. It’s about the same efficiency, but looks nicer.

__host__ __device__ bool operator()(const uint4 &lhs, const uint4 &rhs) const
{
if (lhs.x < rhs.x) return true;
if (lhs.x > rhs.x) return false;
if (lhs.y < rhs.y) return true;
if (lhs.y > rhs.y) return false;
if (lhs.z < rhs.z) return true;
if (lhs.z > rhs.z) return false;
return lhs.w<rhs.w;
}

Excellent advice. I now recall optimizing code this way back in the early x86 days when arithmetic operations had to be synthesized from 16-bit machine instructions, but apparently I had completely forgotten about this important probability aspect!

I assume by “first word” you mean “most significant word”.