Bitslice-DES optimization

Clochette · July 28, 2015, 3:05pm

I am trying to find a higher throughput bit conditional inverse scheme:

// int a, b, c, d;
d = (a & (1 << b)) ? c : ~c;

Here’s what the compiler currently does:

bfe.u32 d, a, b, 1
add.s32 d, d, -1
xor.b32 d, d, c

Quite clever I’ll have to admit, but are there faster representations? (Invert c and then selp is slower if the latency is hidden to a degree)

(I’m using CUDA 7.5RC and targeting sm_52)

EDIT: Please see post #10 below for more information.
EDIT2: I’ve open-sourced my implementation. See post #48.

cbuchner1 · July 28, 2015, 3:36pm

Could the new LOP3 opcode help at all?

A related thread is here: https://devtalk.nvidia.com/default/topic/690631/cuda-programming-and-performance/so-whats-new-about-maxwell-/10/

Clochette · July 28, 2015, 4:01pm

Unfortunately I don’t see how it would help in this case. If you can get to 0xFFFFFFFF (-1) and 0, you only need a an xor.b32.

njuffa · July 28, 2015, 4:22pm

While BFE probably has limited throughput, any other fully general solution I can think of has at least four instructions and is unlikely to be faster. The question itself suggests that this may be a case of an XY problem.

Clochette · July 28, 2015, 4:54pm

njuffa, I am implementing bit-Slice DES that has a lot of lines that look like this:

s1(d_31^k_12,d_0^k_46,d_1^k_33,d_2^k_52,d_3^k_48,d_4^k_20,&d_40,&d_48,&d_54,&d_62);

Due to the nature of bit-slice DES, a lot of the keys (k_n) are all ones (0xFFFFFFFF) or all zeros (0) (but whether they are ones or zeros are not known until runtime), so the datas (d_m) are simply either flipped or not before feeding into the s1 function. To relieve register pressure and aim for a higher occupancy, I thought of compressing 32 all one or all zero keys in to one integer, and use a bit conditional scheme to flip the data. After benchmarking, I discovered that using the scheme posted in OP can indeed increase occupancy, but the actual speed of the algorithm is reduced due to the additional instructions, so I asked if there is a better bit conditional inverse scheme.

s1 is a complex function regarded as already optimal and cannot be changed.

njuffa · July 28, 2015, 5:08pm

I am afraid that without a lot more additional context that line of code means nothing to me. Are you porting a particular public code base that you could point at? It’s been a long time since I last worked on DES implementation, but as I recall minimizing the SBOX logic with espresso was part of the process. I assume that kind of optimization has already been applied by the author(s) of the code?

How high is the occupancy currently? If the code is compute-limited (and from you comments it appears it may be) you very likely do not need extremely high occupancy for full performance.

allanmac · July 28, 2015, 5:09pm

How about the following PTX:

d = slct(~c, c, shl(a, 31-b))

Or more precisely:

shl.b32       e,a,31-b;
slct.b32.s32  d,~c,c,e;

It should be 2 ops ― shl + slct ― but only if you can somehow not count the adjustment to ‘b’. [ Edit: and the ~c. ]

I don’t know if this will generate the PTX sequence I’m imagining or if you’ll have to drop down to writing inline PTX:

d = a << (31-b) >= 0 ? ~c : c;

You’ll have to dump the SASS on your target arch to see what really gets emitted though.

Later… Ah, but I see that ~c is also dynamic.

OK, it’s going to be more than 2 ops but the SLCT opcode is interesting. :)

njuffa · July 28, 2015, 5:26pm

Very clever. I was thinking right shift, instead of thinking left shift to exploit the sign bit, which often is a good idea.

Robert_Crovella · July 28, 2015, 5:32pm

I’ll take beat the compiler for $100, please, Alex.

Clochette · July 28, 2015, 5:49pm

njuffa, here’s a sample code snippet which is already a simplification of this.

Things I’ve tried to optimize it:

Partially unroll the r loop (#pragma unroll 2, #pramga unroll 4), including manually unroll it and eliminate switches. They help to a degree, but is not included in file for clarity. You can completely eliminate the switches by fully unrolling the r loop (see “this” above). But doing so make the code bigger than the instruction cache, leading to instruction fetch stall that incurs ~60% performance penalty.
Compute two or more s functions after one switch. This leads to better interlacing and less execution dependency stall, but increases the register pressure since you need more set of temporary variable p_n’s. The sweet spot seems to be two s functions after one switch (at 168 regs) or eight s functions after one switch (at ~220 regs), but I’m only showing one s function after one switch here for clarity.

Things I’m currently investigating:

Whether compressing the keys helps. Note that the additional instruction mays bloat the code and make loop unrolling less effective.
Whether the data swap at the end of each round (i loop) can be eliminated. Since it could potentially lead to more switches, I am not having much hope.
Using a key swapping method to eliminate switches: This seems to be faster but too incurs additional instructions.
Whether the xor in the SBOXes and the xor outside the SBOXes can be combined with a single LOP3.

The SBOXes are from here, discovered by myself with a stochastic algorithm developed by myself and partially adapted from Matthew Kwan’s “Reducing the Gate Count of Bitslice DES”. They are the only full sets of SBOXes with LOP3.LUT on the web at the time of writing and I believe are very close to optimal (they have 23.5 avg. gate count). They were discovered with $50 Amazon EC2 credit spent on a 36 vcore server (c4.8xlarge spot instance running for a week) and I am not yet ready to invest more money. Roman Rusakov mentioned a s4 (the least complex SBOX function) expression here down by one gate but he uses a deterministic algorithm that will take even longer to find the others (which I assume it’s why only s4 was given at the time), and my average gate count is better than his estimate at the time of this post.

The speed of the algorithm is compute bounded with 80% warp issue efficiency. The stalls are equal execution dependency and instruction fetch. Arithmetic pipe is used to 80%, but I’m trying to push another 10%. Occupancy is 18.75% with 168 registers.

Tell me if you want more code or even the whole project.

Clochette · July 28, 2015, 5:57pm

allanmac, that is very interesting, but I can’t cache the inverse of c. However b is known at compile time, so I’ll definitely try the shl.b32 and then another two instructions to get d. Since SHL is faster than BFE, I’ll see if that helps.

allanmac · July 28, 2015, 6:05pm

Ah, then maybe you could try:

xor(c,shr.s32(shl.b32(a,31-b),31))

This will return ~c if bit ‘b’ in ‘a’ is set (and c if not set) which is not what you originally asked for.

njuffa · July 28, 2015, 6:08pm

If b is known at compile time, I see no need to shift at all. You can simply precompute (1 << b) and use the resulting mask as an immediate constant to test the bit, then use ISET to check the result against zero (this gives either 0xFFFFFFFF or 0x00000000) then XOR with the ISET result. That is three fast instructions.

I am surprised by the statement that if the unrolled inner loop exceeds the ICache size there is a hefty performance penalty. In my experience that penalty has never been larger than about 3%. I wonder whether you are hitting a pathological case of ICache thrashing? Or maybe some Maxwell-specific issue (I have zero experience with Maxwell). The GPU does not have branch prediction and always fetches in straight ascending address order. The loop closing branch on a large loop body that exceeds the ICache will cause fetching to be restarted after the backwards branch and is immediately followed by an ICache miss.

@allanmac: In all likelihood you will find that arithmetic right shift is slow, because if I recall correctly the funnel shifter does not support arithmetic right shift. Which is fine since it is rare operation in general. To generate masks of all 0s or all 1s from the sign bit, one would either want to use ICMP, ISET, or the sign-extension feature of PERM (only accessible from PTX).

allanmac · July 28, 2015, 6:11pm

I give up — you can’t beat @njuffa or the compiler!

Clochette · July 28, 2015, 6:16pm

Thank you njuffa. That fact that you don’t need to shift at all eluded me presumably due to lack of sleep. I’ll try that now.

Clochette · July 28, 2015, 6:25pm

I observed a 5% performance gain after using what njuffa suggested, but that’s still not enough to offset the performance lost due to bloated code and additional instructions. If anyone have any suggestions regarding what I posted in post #10, I’m all ears.

njuffa · July 28, 2015, 6:40pm

If you check the SASS, do you actually see an ISET instruction? You may need to code with inline PTX because my observation is that the compiler has acquired a distaste for ISET in recent years. Not sure whether that is indicative of reduced throughput for that instruction. So at PTX level you’d have something like

and.b32         t, a, k;   // t = a & k, with k = (1 << b), an immediate constant
set.eq.u32.b32  m, t, 0;   // m = (t == 0) ? 0xffffffff : 0
xor.b32         d, m, c;   // (t == 0) ? ~c : c

Speaking of generating a mask of all 0s or 1s from the sign bit, I recently saw the CUDA compiler use __mulhi(x, 2) to do that. Well, clever idea, it works, but I am still wondering when this would ever be advantageous compared to the other methods. Maybe it analyzed the local pressure on various execution units and found that it was high while the multiplier was unused at present.

Clochette · July 28, 2015, 6:47pm

Yes, I am using inline PTX. The performance is compared with when using what’s mentioned in OP with inline PTX.

njuffa · July 28, 2015, 6:52pm

It is hard to tell what the code snippet in #10 does since the macro is not shown. Here is an idea in the “wave a rubber chicken over your monitor” voodoo magic class: I notice that all operands are of type ‘int’ although we seem to be doing only logic operations. Does anything change if you switch that to ‘unsigned int’?

I am still bummed out that fully unrolling the innermost loop does horrible things to performance according to your experiment, because that seems like such an obvious way to boost performance. switch statements are somewhat of a performance killer on the GPU as I recall, and I have avoided them like the plague.

Clochette · July 28, 2015, 6:59pm

Which snippet are you talking about? I think I’ve shown all the macros.

Topic		Replies	Views
CUDA application for optimizing Bitslice DES CUDA Programming and Performance	1	803	June 29, 2018
How to parallelize DES encryption algorithm in CUDA? CUDA Programming and Performance	5	2005	August 1, 2018
Making bit slice DES CUDA Programming and Performance	2	2040	May 26, 2013
is anyone CUDA optimized version of linux crypt(3) function? CUDA Programming and Performance	7	9604	January 22, 2010
So what's new about Maxwell? CUDA Programming and Performance	166	57707	March 10, 2015
DES encryption in CUDA CUDA Programming and Performance	0	593	March 10, 2017
Low or normal performance? CUDA Programming and Performance cuda	20	1406	November 13, 2020
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13805	May 11, 2016
Extra MOV instructions? ...doubles the number of instructions executed... CUDA Programming and Performance	10	7757	July 15, 2010
Salsa 20 Ransomware Brute Force program CUDA Programming and Performance	1	1184	May 26, 2016

Bitslice-DES optimization

Related topics