I decided to get acquainted with CUDA because I have a GeForce GTX 1070 I want to take advantage of. I decided to port a C program I have written to perform on the GPU. The program cracks a password database that was generated by an intentionally-vulnerable password manager as part of a CTF contest. The way it does this is because the password manager serializes the bits of a text password into a 32-bit seed integer and then uses rand() to derive the bytes of the key, so I brute-force the 32-bit range of seeds until the right one which can derive the proper key is found. On my CPU I can perform a bout 3 million attempted decryptions per second but with the GPU code I’ve only managed to get about 5000 attempts per second.
The big bottleneck seems to be in the way that I have to generate random values. Because the whole thing depends on the same random values being reached per respective seed, I can’t use cuRAND, but instead have to use a custom rolled version of glibc’s rand() function to get the correct values. This method seems to be what is slowing everything down though, because if I run the program without generating random values via this method it’s much faster ( though of course I never get the right key).
I tried asking for help on another CUDA community but I’m not really sure the advice given was sound. Firstly, someone suggested my main problem was that I was only using 1920 blocks with 1 thread per block so I changed it to use multiple threads per block but it’s actually slower in that version. The other thing someone pointed out was that the random function was supposedly incompatible with CUDA because I needed to maintain a static iterator between successive calls, but I’ve done a lot of testing and it seems to produce the correct random values and it will eventually get the right value for a given password, it just takes ages. Depending on the configuration of threads per block and number of blocks, it can try 20-300 seed values per second.
This is the version using only blocks. I’ve assigned a lot of the memory as shared and it can try about 5000 seeds per second because of it. In this version I have it using multiple threads per block, but I can’t use shared memory because each thread will change the memory area and mess up the decryption results, and it’s much slower at only 2-300 seeds per second. It seems like the more overall threads I run, the slower it performs.
I’ve tried to run it through nvprof to see what’s taking the most time, but it doesn’t seem to have that much resolution–it just says most of the time is taken up in my kernel, which was obvious. So I’m hoping maybe someone can take a look at the source and give me some hints at what is holding the performance back so much and if this can even be sped up to work as fast as the CPU version.