Bitslice-DES optimization

[url]http://pastie.org/private/0c3brtsgwzvdxvrirbynrq[/url] There is a macro for each SBOX, but the macros themselves, e.g. s1(), are not shown. Presumably they are pulled in from “sbox.h”. I assume they mostly do a bunch of XORs?

I do not have any additional ideas at the moment. It seems you have thought long and hard about the issue and already explored all the high-level algorithmic details. Now I am simply curious where the intense interest in DES comes from at this time, given that it became obsolete for practical encryption needs many years ago. Already when I worked on an FPGA implementation of DES back in the early 1990s security conscious types recommended triple-DES because the effective key length of plain DES was just too short.

Changing datas and keys to unsigned int doesn’t seem to improve performance, and doesn’t even seem to change SASS at all. Nor it should, because the heavy lifting part is all LOP3.LUT and LOP.XOR already (which is very sexy).

And, if fully unrolling works we wouldn’t be having this conversation :D

I’m eager to see what Pascal has to offer other than 3D memory and faster FP16. Please be larger instruction cache!

They are inline functions from here, which I posted in #10 (you might have missed it).

The interest comes from generating vanity tripcodes. For example,

#騨レNWKJ諤

maps to

!YYYYYYYYYY

You can test it here.

1 Like

Well, as I said this was a “wave a rubber chicken over the monitor” kind of exercise. We wouldn’t expect any differences, but it is quick thing to try on the off chance that the signedness is interfering with some compiler optimization. Usually it is the other way around: Since unsigned arithmetic has prescribed wrap around behavior, using ‘int’ instead of ‘unsigned int’ is often the higher-performance approach.

One interesting parallel of this LUT-based approach on Maxwell to my FPGA work is that the FPGAs I used at the time, I think Xilinx XC4000, were based on 5-input LUT hardware building blocks. So the SBOXes were the least of my worries. As I recall the wiring done by the primitive automatic place and route tools of the time was killing performance, since I used a significant portion of the CLBs on the chip.

Tripcodes? I had never heard of those so I just learned something new, which is cool.

5-input LUT? That sounds crazy exactly what I needed. Does it take an 32-bit unsigned int input as the actual table?

I have another question. In the key swapping approach I posted in #10, you can notice the strange "switch(i1)"s in my desperate attempt to make it a jump block. Whether I use switch or if, they seems to be mapped to a bunch of SELs or ICMPs. Is there a way to make the compiler generate an actual jump?

Well the details of these CLBs were more complex. I found what looks like a reasonable description and diagram here: [url]https://www.clear.rice.edu/elec522/w6/xapp043.pdf[/url] Quote: “The F-G-H combination can implement any function of five inputs.”

I haven’t paid attention to how FPGAs evolved later on. However, the trend even back then was to go away from CLBs towards simpler hardware primitives with finer granularity, since oftentimes CLBs were underutilized in real-life designs. How often do we really need arbitrary functions of five inputs? But for the DES SBOXes these complex logic blocks worked well.

Turn off generation of SELs and ICMPs? Assuming the branches are still there at the PTX level, you can try lowering PTXAS optimization to -Xptxas -O2 or even lower to inhibit if-conversion. But that likely has other negative performance consequences. I am familiar with the general problem, but as long as the CUDA toolchain has no way of expressing branch probabilities or profile-driven optimization, there really isn’t a solution, as the compiler tries to eliminate branches, not knowing that in some cases the branch may be taken so rarely that it would be best to leave it as a branch.

I missed your second paragraph because you edited it.

It’s better to show you:
Using “this” code in #10:

Remove 64 lines of s functions (there’re a total of 128 lines) and make the i loop 50 times instead of 25 times:

Huge difference. Further removement doesn’t seem to improve performance much, because they reduce interleaving of instructions at the start and the end of the loop. My hypothesis is that the instruction fetch issue when code size exceeds instruction cache is always there, but for high occupancy kernels, they can be well hidden, and since I have used 168 registers, the problem is exacerbated.

Note that the “if (i != 24)” data swapping has little performance impact, and the results above is acquired with it removed. The program generates wrong result if we remove it, but for performance analysis it’s irrelevant.

They are gone at the PTX level (which have selp.b32s). May be I’ll try comparing it against a float instead of int.

I compared it against 0.5f instead of comparing against 1. The swaps are now correctly mapped to MOVs, and I’ve got 200 bytes register spilling. Maybe inline PTX will fix it.

If you can enforce certain restrictions on the integers being compared, you can use ‘float’ comparison on the re-interpreted bit pattern instead of ‘int’ comparison just to speed things up in general. I actually used this technique in a few places in the CUDA math library. It is pretty hacky for sure, especially if the ‘int’ in question is actually the upper half of a double-precision floating-point number :-)

The operands must be of the same sign, and must avoid NaN and denormal encodings since the ‘float’ comparisons are not properly ordered for those bit patterns. Actually, denormals would be fine if you can ensure the code is never built with -ftz=true. So small positive integers in particular can be compared after re-interpretation with __float_as_int() as long as there is no flush-to-zero.

Note that there are floating-point select instructions as well, so this may not solve your issue with undesired if-conversions being applied by the compiler.

Using inline PTX helped, and it even eliminated register spilling from the original key swapping code (~70 bytes to 0 byte). To sum it up, the compiler turned a conditional MOV block into no-jump SELs and ICMPs, I prevented it, and traded some pipe busy stall (SELs and ICMPs) for execution dependency stall (the SELs and ICMPs were interleaved with computation) and instruction fetch stall (conditional jumps). The performance and warp issue efficiency doesn’t seem to change though, but the GPU does use a little bit less power which is nice.

Regarding the instruction stall issue with this code mentioned in post #29, I did some testing with the all new CUDA 7.5RC visual profiler trying to find on what lines do the warps stall.

The stalling (green one corresponds with instruction fetch stall) seems to be spaced out evenly at every 24 instructions:

With the second version in post #29 (after removing 64 lines of s functions), it seems to be stalling (much less) every 12 instructions:

And with further removal, the interval stays at 12.

Why, and could it be related to as to why the original code stalls too much?

Hard to say. Keep in mind that Maxwell inserts a steering instruction for every three instructions, so the 12 instructions are really 16 instruction slots. Each instruction is 8 bytes, so 16 instructions are 128 bytes which may be equal to the length of a cache line. Just a hypothesis. scottgray has studied the Maxwell architecture in much detail, he may have a better idea.

Thanks. I suspected the numbers might be related to the different levels of cache size, but 12 and 24 didn’t strike me as “GPU numbers” that are powers of 2. Where can I read about the steering instruction in Maxwell? I skimmed through the GTX 980 whitepaper and couldn’t find mentions of it.

To my knowledge there is no public documentation other than what scottgray reverse engineered. However, a simple disassembly of a Maxwell binary will show that these steering instructions are there because there will be a break in the address sequence after every three instructions.

After posting post #33, I’ve also done the same for the data swap right before the end of the i loop. Doing so nearly eliminated pipe busy stall. Warp issue efficiency is now 84% while arithmetic load stays at 86%. 70% of the stalls are due to instruction fetch, which partly I introduced along with forcing the compiler to create a jump block, partly due to the size of code which I have no control of with the current method. (If I further reduce the code length with the key swapping method, the key swap overhead would be non trivial, and that was the finding of my benchmarking)

With the key swapping method, I am getting close to 3000G bitops (I weighted LOP3 as 1, but it should really be 2 or more…) per second and roughly 900M tripcodes/s * 25 rounds = 22500M DES keys/s on my 980 Ti.

I think I can squeeze another 5% in there with

I mentioned in post #10, so that less instructions are needed insize the inner loop. It might reduce instruction fetch stalls as well.

EDIT: It’s not immediately obvious how to do this correctly without introducing additional instructions other than the LOP3 and additional registers, since it interferes with all future rounds (of 16 in DES) of computation.

Forgive the noob question. What is the 900M tripcodes/s number? Is the purpose to be able to compute a user’s password from a given tripcode?

Yes, you can put it that way. The “password” is for the identification, not user data. Though more commonly it’s used to find vanity tripcodes. A 10 character tripcode is generated with UNIX DES crypt(3) (which is 25 rounds of DES), with the key being the password, and the plain text being all zero, and the salt computed from two password characters.