Bitslice-DES optimization

I am trying to find a higher throughput bit conditional inverse scheme:

// int a, b, c, d;
d = (a & (1 << b)) ? c : ~c;

Here’s what the compiler currently does:

bfe.u32 d, a, b, 1
add.s32 d, d, -1
xor.b32 d, d, c

Quite clever I’ll have to admit, but are there faster representations? (Invert c and then selp is slower if the latency is hidden to a degree)

(I’m using CUDA 7.5RC and targeting sm_52)

EDIT: Please see post #10 below for more information.
EDIT2: I’ve open-sourced my implementation. See post #48.

Could the new LOP3 opcode help at all?

A related thread is here: https://devtalk.nvidia.com/default/topic/690631/cuda-programming-and-performance/so-whats-new-about-maxwell-/10/

Unfortunately I don’t see how it would help in this case. If you can get to 0xFFFFFFFF (-1) and 0, you only need a an xor.b32.

While BFE probably has limited throughput, any other fully general solution I can think of has at least four instructions and is unlikely to be faster. The question itself suggests that this may be a case of an XY problem.

njuffa, I am implementing bit-Slice DES that has a lot of lines that look like this:

s1(d_31^k_12,d_0^k_46,d_1^k_33,d_2^k_52,d_3^k_48,d_4^k_20,&d_40,&d_48,&d_54,&d_62);

Due to the nature of bit-slice DES, a lot of the keys (k_n) are all ones (0xFFFFFFFF) or all zeros (0) (but whether they are ones or zeros are not known until runtime), so the datas (d_m) are simply either flipped or not before feeding into the s1 function. To relieve register pressure and aim for a higher occupancy, I thought of compressing 32 all one or all zero keys in to one integer, and use a bit conditional scheme to flip the data. After benchmarking, I discovered that using the scheme posted in OP can indeed increase occupancy, but the actual speed of the algorithm is reduced due to the additional instructions, so I asked if there is a better bit conditional inverse scheme.

s1 is a complex function regarded as already optimal and cannot be changed.

I am afraid that without a lot more additional context that line of code means nothing to me. Are you porting a particular public code base that you could point at? It’s been a long time since I last worked on DES implementation, but as I recall minimizing the SBOX logic with espresso was part of the process. I assume that kind of optimization has already been applied by the author(s) of the code?

How high is the occupancy currently? If the code is compute-limited (and from you comments it appears it may be) you very likely do not need extremely high occupancy for full performance.

How about the following PTX:

d = slct(~c, c, shl(a, 31-b))

Or more precisely:

shl.b32       e,a,31-b;
slct.b32.s32  d,~c,c,e;

It should be 2 ops ― shl + slct ― but only if you can somehow not count the adjustment to ‘b’. [ Edit: and the ~c. ]

I don’t know if this will generate the PTX sequence I’m imagining or if you’ll have to drop down to writing inline PTX:

d = a << (31-b) >= 0 ? ~c : c;

You’ll have to dump the SASS on your target arch to see what really gets emitted though.

Later… Ah, but I see that ~c is also dynamic.

OK, it’s going to be more than 2 ops but the SLCT opcode is interesting. :)

Very clever. I was thinking right shift, instead of thinking left shift to exploit the sign bit, which often is a good idea.

I’ll take beat the compiler for $100, please, Alex.

njuffa, here’s a sample code snippet which is already a simplification of this.

Things I’ve tried to optimize it:

  1. Partially unroll the r loop (#pragma unroll 2, #pramga unroll 4), including manually unroll it and eliminate switches. They help to a degree, but is not included in file for clarity. You can completely eliminate the switches by fully unrolling the r loop (see “this” above). But doing so make the code bigger than the instruction cache, leading to instruction fetch stall that incurs ~60% performance penalty.
  2. Compute two or more s functions after one switch. This leads to better interlacing and less execution dependency stall, but increases the register pressure since you need more set of temporary variable p_n’s. The sweet spot seems to be two s functions after one switch (at 168 regs) or eight s functions after one switch (at ~220 regs), but I’m only showing one s function after one switch here for clarity.

Things I’m currently investigating:

  1. Whether compressing the keys helps. Note that the additional instruction mays bloat the code and make loop unrolling less effective.
  2. Whether the data swap at the end of each round (i loop) can be eliminated. Since it could potentially lead to more switches, I am not having much hope.
  3. Using a key swapping method to eliminate switches: This seems to be faster but too incurs additional instructions.
  4. Whether the xor in the SBOXes and the xor outside the SBOXes can be combined with a single LOP3.

The SBOXes are from here, discovered by myself with a stochastic algorithm developed by myself and partially adapted from Matthew Kwan’s “Reducing the Gate Count of Bitslice DES”. They are the only full sets of SBOXes with LOP3.LUT on the web at the time of writing and I believe are very close to optimal (they have 23.5 avg. gate count). They were discovered with $50 Amazon EC2 credit spent on a 36 vcore server (c4.8xlarge spot instance running for a week) and I am not yet ready to invest more money. Roman Rusakov mentioned a s4 (the least complex SBOX function) expression here down by one gate but he uses a deterministic algorithm that will take even longer to find the others (which I assume it’s why only s4 was given at the time), and my average gate count is better than his estimate at the time of this post.

The speed of the algorithm is compute bounded with 80% warp issue efficiency. The stalls are equal execution dependency and instruction fetch. Arithmetic pipe is used to 80%, but I’m trying to push another 10%. Occupancy is 18.75% with 168 registers.

Tell me if you want more code or even the whole project.

allanmac, that is very interesting, but I can’t cache the inverse of c. However b is known at compile time, so I’ll definitely try the shl.b32 and then another two instructions to get d. Since SHL is faster than BFE, I’ll see if that helps.

Ah, then maybe you could try:

xor(c,shr.s32(shl.b32(a,31-b),31))

This will return ~c if bit ‘b’ in ‘a’ is set (and c if not set) which is not what you originally asked for.

If b is known at compile time, I see no need to shift at all. You can simply precompute (1 << b) and use the resulting mask as an immediate constant to test the bit, then use ISET to check the result against zero (this gives either 0xFFFFFFFF or 0x00000000) then XOR with the ISET result. That is three fast instructions.

I am surprised by the statement that if the unrolled inner loop exceeds the ICache size there is a hefty performance penalty. In my experience that penalty has never been larger than about 3%. I wonder whether you are hitting a pathological case of ICache thrashing? Or maybe some Maxwell-specific issue (I have zero experience with Maxwell). The GPU does not have branch prediction and always fetches in straight ascending address order. The loop closing branch on a large loop body that exceeds the ICache will cause fetching to be restarted after the backwards branch and is immediately followed by an ICache miss.

@allanmac: In all likelihood you will find that arithmetic right shift is slow, because if I recall correctly the funnel shifter does not support arithmetic right shift. Which is fine since it is rare operation in general. To generate masks of all 0s or all 1s from the sign bit, one would either want to use ICMP, ISET, or the sign-extension feature of PERM (only accessible from PTX).

I give up — you can’t beat @njuffa or the compiler!

Thank you njuffa. That fact that you don’t need to shift at all eluded me presumably due to lack of sleep. I’ll try that now.

I observed a 5% performance gain after using what njuffa suggested, but that’s still not enough to offset the performance lost due to bloated code and additional instructions. If anyone have any suggestions regarding what I posted in post #10, I’m all ears.

If you check the SASS, do you actually see an ISET instruction? You may need to code with inline PTX because my observation is that the compiler has acquired a distaste for ISET in recent years. Not sure whether that is indicative of reduced throughput for that instruction. So at PTX level you’d have something like

and.b32         t, a, k;   // t = a & k, with k = (1 << b), an immediate constant
set.eq.u32.b32  m, t, 0;   // m = (t == 0) ? 0xffffffff : 0
xor.b32         d, m, c;   // (t == 0) ? ~c : c

Speaking of generating a mask of all 0s or 1s from the sign bit, I recently saw the CUDA compiler use __mulhi(x, 2) to do that. Well, clever idea, it works, but I am still wondering when this would ever be advantageous compared to the other methods. Maybe it analyzed the local pressure on various execution units and found that it was high while the multiplier was unused at present.

Yes, I am using inline PTX. The performance is compared with when using what’s mentioned in OP with inline PTX.

It is hard to tell what the code snippet in #10 does since the macro is not shown. Here is an idea in the “wave a rubber chicken over your monitor” voodoo magic class: I notice that all operands are of type ‘int’ although we seem to be doing only logic operations. Does anything change if you switch that to ‘unsigned int’?

I am still bummed out that fully unrolling the innermost loop does horrible things to performance according to your experiment, because that seems like such an obvious way to boost performance. switch statements are somewhat of a performance killer on the GPU as I recall, and I have avoided them like the plague.

Which snippet are you talking about? I think I’ve shown all the macros.