Bitslice-DES optimization

njuffa · July 28, 2015, 7:10pm

[url]http://pastie.org/private/0c3brtsgwzvdxvrirbynrq[/url] There is a macro for each SBOX, but the macros themselves, e.g. s1(), are not shown. Presumably they are pulled in from “sbox.h”. I assume they mostly do a bunch of XORs?

I do not have any additional ideas at the moment. It seems you have thought long and hard about the issue and already explored all the high-level algorithmic details. Now I am simply curious where the intense interest in DES comes from at this time, given that it became obsolete for practical encryption needs many years ago. Already when I worked on an FPGA implementation of DES back in the early 1990s security conscious types recommended triple-DES because the effective key length of plain DES was just too short.

Clochette · July 28, 2015, 7:10pm

Changing datas and keys to unsigned int doesn’t seem to improve performance, and doesn’t even seem to change SASS at all. Nor it should, because the heavy lifting part is all LOP3.LUT and LOP.XOR already (which is very sexy).

And, if fully unrolling works we wouldn’t be having this conversation :D

I’m eager to see what Pascal has to offer other than 3D memory and faster FP16. Please be larger instruction cache!

Clochette · July 28, 2015, 7:11pm

They are inline functions from here, which I posted in #10 (you might have missed it).

Clochette · July 28, 2015, 7:16pm

The interest comes from generating vanity tripcodes. For example,

#騨ﾚNWKJ諤

maps to

!YYYYYYYYYY

You can test it here.

njuffa · July 28, 2015, 7:19pm

Well, as I said this was a “wave a rubber chicken over the monitor” kind of exercise. We wouldn’t expect any differences, but it is quick thing to try on the off chance that the signedness is interfering with some compiler optimization. Usually it is the other way around: Since unsigned arithmetic has prescribed wrap around behavior, using ‘int’ instead of ‘unsigned int’ is often the higher-performance approach.

One interesting parallel of this LUT-based approach on Maxwell to my FPGA work is that the FPGAs I used at the time, I think Xilinx XC4000, were based on 5-input LUT hardware building blocks. So the SBOXes were the least of my worries. As I recall the wiring done by the primitive automatic place and route tools of the time was killing performance, since I used a significant portion of the CLBs on the chip.

Tripcodes? I had never heard of those so I just learned something new, which is cool.

Clochette · July 28, 2015, 7:27pm

5-input LUT? That sounds crazy exactly what I needed. Does it take an 32-bit unsigned int input as the actual table?

Clochette · July 28, 2015, 7:32pm

I have another question. In the key swapping approach I posted in #10, you can notice the strange "switch(i1)"s in my desperate attempt to make it a jump block. Whether I use switch or if, they seems to be mapped to a bunch of SELs or ICMPs. Is there a way to make the compiler generate an actual jump?

njuffa · July 28, 2015, 7:38pm

Well the details of these CLBs were more complex. I found what looks like a reasonable description and diagram here: [url]https://www.clear.rice.edu/elec522/w6/xapp043.pdf[/url] Quote: “The F-G-H combination can implement any function of five inputs.”

I haven’t paid attention to how FPGAs evolved later on. However, the trend even back then was to go away from CLBs towards simpler hardware primitives with finer granularity, since oftentimes CLBs were underutilized in real-life designs. How often do we really need arbitrary functions of five inputs? But for the DES SBOXes these complex logic blocks worked well.

Turn off generation of SELs and ICMPs? Assuming the branches are still there at the PTX level, you can try lowering PTXAS optimization to -Xptxas -O2 or even lower to inhibit if-conversion. But that likely has other negative performance consequences. I am familiar with the general problem, but as long as the CUDA toolchain has no way of expressing branch probabilities or profile-driven optimization, there really isn’t a solution, as the compiler tries to eliminate branches, not knowing that in some cases the branch may be taken so rarely that it would be best to leave it as a branch.

Clochette · July 28, 2015, 7:57pm

If b is known at compile time, I see no need to shift at all. You can simply precompute (1 << b) and use the resulting mask as an immediate constant to test the bit, then use ISET to check the result against zero (this gives either 0xFFFFFFFF or 0x00000000) then XOR with the ISET result. That is three fast instructions.

I am surprised by the statement that if the unrolled inner loop exceeds the ICache size there is a hefty performance penalty. In my experience that penalty has never been larger than about 3%. I wonder whether you are hitting a pathological case of ICache thrashing? Or maybe some Maxwell-specific issue (I have zero experience with Maxwell). The GPU does not have branch prediction and always fetches in straight ascending address order. The loop closing branch on a large loop body that exceeds the ICache will cause fetching to be restarted after the backwards branch and is immediately followed by an ICache miss.

@allanmac: In all likelihood you will find that arithmetic right shift is slow, because if I recall correctly the funnel shifter does not support arithmetic right shift. Which is fine since it is rare operation in general. To generate masks of all 0s or all 1s from the sign bit, one would either want to use ICMP, ISET, or the sign-extension feature of PERM (only accessible from PTX).

I missed your second paragraph because you edited it.

It’s better to show you:
Using “this” code in #10:

Remove 64 lines of s functions (there’re a total of 128 lines) and make the i loop 50 times instead of 25 times:

Huge difference. Further removement doesn’t seem to improve performance much, because they reduce interleaving of instructions at the start and the end of the loop. My hypothesis is that the instruction fetch issue when code size exceeds instruction cache is always there, but for high occupancy kernels, they can be well hidden, and since I have used 168 registers, the problem is exacerbated.

Note that the “if (i != 24)” data swapping has little performance impact, and the results above is acquired with it removed. The program generates wrong result if we remove it, but for performance analysis it’s irrelevant.

Clochette · July 28, 2015, 8:00pm

Well the details of these CLBs were more complex. I found what looks like a reasonable description and diagram here: https://www.clear.rice.edu/elec522/w6/xapp043.pdf Quote: “The F-G-H combination can implement any function of five inputs.”

I haven’t paid attention to how FPGAs evolved later on. However, the trend even back then was to go away from CLBs towards simpler hardware primitives with finer granularity, since oftentimes CLBs were underutilized in real-life designs. How often do we really need arbitrary functions of five inputs? But for the DES SBOXes these complex logic blocks worked well.

Turn off generation of SELs and ICMPs? Assuming the branches are still there at the PTX level, you can try lowering PTXAS optimization to -Xptxas -O2 or even lower to inhibit if-conversion. But that likely has other negative performance consequences. I am familiar with the general problem, but as long as the CUDA toolchain has no way of expressing branch probabilities or profile-driven optimization, there really isn’t a solution, as the compiler tries to eliminate branches, not knowing that in some cases the branch may be taken so rarely that it would be best to leave it as a branch.

They are gone at the PTX level (which have selp.b32s). May be I’ll try comparing it against a float instead of int.

Clochette · July 28, 2015, 8:11pm

Well the details of these CLBs were more complex. I found what looks like a reasonable description and diagram here: https://www.clear.rice.edu/elec522/w6/xapp043.pdf Quote: “The F-G-H combination can implement any function of five inputs.”

I haven’t paid attention to how FPGAs evolved later on. However, the trend even back then was to go away from CLBs towards simpler hardware primitives with finer granularity, since oftentimes CLBs were underutilized in real-life designs. How often do we really need arbitrary functions of five inputs? But for the DES SBOXes these complex logic blocks worked well.

Turn off generation of SELs and ICMPs? Assuming the branches are still there at the PTX level, you can try lowering PTXAS optimization to -Xptxas -O2 or even lower to inhibit if-conversion. But that likely has other negative performance consequences. I am familiar with the general problem, but as long as the CUDA toolchain has no way of expressing branch probabilities or profile-driven optimization, there really isn’t a solution, as the compiler tries to eliminate branches, not knowing that in some cases the branch may be taken so rarely that it would be best to leave it as a branch.

They are gone at the PTX level (which have selp.b32s). May be I’ll try comparing it against a float instead of int.

I compared it against 0.5f instead of comparing against 1. The swaps are now correctly mapped to MOVs, and I’ve got 200 bytes register spilling. Maybe inline PTX will fix it.

njuffa · July 28, 2015, 8:15pm

If you can enforce certain restrictions on the integers being compared, you can use ‘float’ comparison on the re-interpreted bit pattern instead of ‘int’ comparison just to speed things up in general. I actually used this technique in a few places in the CUDA math library. It is pretty hacky for sure, especially if the ‘int’ in question is actually the upper half of a double-precision floating-point number :-)

The operands must be of the same sign, and must avoid NaN and denormal encodings since the ‘float’ comparisons are not properly ordered for those bit patterns. Actually, denormals would be fine if you can ensure the code is never built with -ftz=true. So small positive integers in particular can be compared after re-interpretation with __float_as_int() as long as there is no flush-to-zero.

Note that there are floating-point select instructions as well, so this may not solve your issue with undesired if-conversions being applied by the compiler.

Clochette · July 28, 2015, 8:36pm

Well the details of these CLBs were more complex. I found what looks like a reasonable description and diagram here: https://www.clear.rice.edu/elec522/w6/xapp043.pdf Quote: “The F-G-H combination can implement any function of five inputs.”

I haven’t paid attention to how FPGAs evolved later on. However, the trend even back then was to go away from CLBs towards simpler hardware primitives with finer granularity, since oftentimes CLBs were underutilized in real-life designs. How often do we really need arbitrary functions of five inputs? But for the DES SBOXes these complex logic blocks worked well.

Turn off generation of SELs and ICMPs? Assuming the branches are still there at the PTX level, you can try lowering PTXAS optimization to -Xptxas -O2 or even lower to inhibit if-conversion. But that likely has other negative performance consequences. I am familiar with the general problem, but as long as the CUDA toolchain has no way of expressing branch probabilities or profile-driven optimization, there really isn’t a solution, as the compiler tries to eliminate branches, not knowing that in some cases the branch may be taken so rarely that it would be best to leave it as a branch.

They are gone at the PTX level (which have selp.b32s). May be I’ll try comparing it against a float instead of int.

I compared it against 0.5f instead of comparing against 1. The swaps are now correctly mapped to MOVs, and I’ve got 200 bytes register spilling. Maybe inline PTX will fix it.

Using inline PTX helped, and it even eliminated register spilling from the original key swapping code (~70 bytes to 0 byte). To sum it up, the compiler turned a conditional MOV block into no-jump SELs and ICMPs, I prevented it, and traded some pipe busy stall (SELs and ICMPs) for execution dependency stall (the SELs and ICMPs were interleaved with computation) and instruction fetch stall (conditional jumps). The performance and warp issue efficiency doesn’t seem to change though, but the GPU does use a little bit less power which is nice.

Clochette · July 28, 2015, 9:52pm

Regarding the instruction stall issue with this code mentioned in post #29, I did some testing with the all new CUDA 7.5RC visual profiler trying to find on what lines do the warps stall.

The stalling (green one corresponds with instruction fetch stall) seems to be spaced out evenly at every 24 instructions:

With the second version in post #29 (after removing 64 lines of s functions), it seems to be stalling (much less) every 12 instructions:

And with further removal, the interval stays at 12.

Why, and could it be related to as to why the original code stalls too much?

njuffa · July 28, 2015, 10:14pm

Hard to say. Keep in mind that Maxwell inserts a steering instruction for every three instructions, so the 12 instructions are really 16 instruction slots. Each instruction is 8 bytes, so 16 instructions are 128 bytes which may be equal to the length of a cache line. Just a hypothesis. scottgray has studied the Maxwell architecture in much detail, he may have a better idea.

Clochette · July 28, 2015, 10:21pm

Thanks. I suspected the numbers might be related to the different levels of cache size, but 12 and 24 didn’t strike me as “GPU numbers” that are powers of 2. Where can I read about the steering instruction in Maxwell? I skimmed through the GTX 980 whitepaper and couldn’t find mentions of it.

njuffa · July 28, 2015, 10:36pm

To my knowledge there is no public documentation other than what scottgray reverse engineered. However, a simple disassembly of a Maxwell binary will show that these steering instructions are there because there will be a break in the address sequence after every three instructions.

Clochette · July 28, 2015, 11:29pm

After posting post #33, I’ve also done the same for the data swap right before the end of the i loop. Doing so nearly eliminated pipe busy stall. Warp issue efficiency is now 84% while arithmetic load stays at 86%. 70% of the stalls are due to instruction fetch, which partly I introduced along with forcing the compiler to create a jump block, partly due to the size of code which I have no control of with the current method. (If I further reduce the code length with the key swapping method, the key swap overhead would be non trivial, and that was the finding of my benchmarking)

With the key swapping method, I am getting close to 3000G bitops (I weighted LOP3 as 1, but it should really be 2 or more…) per second and roughly 900M tripcodes/s * 25 rounds = 22500M DES keys/s on my 980 Ti.

I think I can squeeze another 5% in there with

I mentioned in post #10, so that less instructions are needed insize the inner loop. It might reduce instruction fetch stalls as well.

EDIT: It’s not immediately obvious how to do this correctly without introducing additional instructions other than the LOP3 and additional registers, since it interferes with all future rounds (of 16 in DES) of computation.

Robert_Crovella · July 28, 2015, 11:47pm

Forgive the noob question. What is the 900M tripcodes/s number? Is the purpose to be able to compute a user’s password from a given tripcode?

Clochette · July 28, 2015, 11:56pm

Yes, you can put it that way. The “password” is for the identification, not user data. Though more commonly it’s used to find vanity tripcodes. A 10 character tripcode is generated with UNIX DES crypt(3) (which is 25 rounds of DES), with the key being the password, and the plain text being all zero, and the salt computed from two password characters.

Topic		Replies	Views
CUDA application for optimizing Bitslice DES CUDA Programming and Performance	1	803	June 29, 2018
How to parallelize DES encryption algorithm in CUDA? CUDA Programming and Performance	5	2005	August 1, 2018
Making bit slice DES CUDA Programming and Performance	2	2040	May 26, 2013
is anyone CUDA optimized version of linux crypt(3) function? CUDA Programming and Performance	7	9604	January 22, 2010
So what's new about Maxwell? CUDA Programming and Performance	166	57710	March 10, 2015
DES encryption in CUDA CUDA Programming and Performance	0	593	March 10, 2017
Low or normal performance? CUDA Programming and Performance cuda	20	1406	November 13, 2020
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13805	May 11, 2016
Extra MOV instructions? ...doubles the number of instructions executed... CUDA Programming and Performance	10	7757	July 15, 2010
Salsa 20 Ransomware Brute Force program CUDA Programming and Performance	1	1184	May 26, 2016

Bitslice-DES optimization

Related topics