Help me understand why my first CUDA program is so slow?

I decided to get acquainted with CUDA because I have a GeForce GTX 1070 I want to take advantage of. I decided to port a C program I have written to perform on the GPU. The program cracks a password database that was generated by an intentionally-vulnerable password manager as part of a CTF contest. The way it does this is because the password manager serializes the bits of a text password into a 32-bit seed integer and then uses rand() to derive the bytes of the key, so I brute-force the 32-bit range of seeds until the right one which can derive the proper key is found. On my CPU I can perform a bout 3 million attempted decryptions per second but with the GPU code I’ve only managed to get about 5000 attempts per second.

The big bottleneck seems to be in the way that I have to generate random values. Because the whole thing depends on the same random values being reached per respective seed, I can’t use cuRAND, but instead have to use a custom rolled version of glibc’s rand() function to get the correct values. This method seems to be what is slowing everything down though, because if I run the program without generating random values via this method it’s much faster ( though of course I never get the right key).

I tried asking for help on another CUDA community but I’m not really sure the advice given was sound. Firstly, someone suggested my main problem was that I was only using 1920 blocks with 1 thread per block so I changed it to use multiple threads per block but it’s actually slower in that version. The other thing someone pointed out was that the random function was supposedly incompatible with CUDA because I needed to maintain a static iterator between successive calls, but I’ve done a lot of testing and it seems to produce the correct random values and it will eventually get the right value for a given password, it just takes ages. Depending on the configuration of threads per block and number of blocks, it can try 20-300 seed values per second.

This is the version using only blocks. I’ve assigned a lot of the memory as shared and it can try about 5000 seeds per second because of it. In this version I have it using multiple threads per block, but I can’t use shared memory because each thread will change the memory area and mess up the decryption results, and it’s much slower at only 2-300 seeds per second. It seems like the more overall threads I run, the slower it performs.

I’ve tried to run it through nvprof to see what’s taking the most time, but it doesn’t seem to have that much resolution–it just says most of the time is taken up in my kernel, which was obvious. So I’m hoping maybe someone can take a look at the source and give me some hints at what is holding the performance back so much and if this can even be sped up to work as fast as the CPU version.

I was only using 1920 blocks with 1 thread per block

That is horrible. You are throwing away up to 97% of performance. In the absence of any other guiding information, rule of thumb: thread count per block should be multiple of 32 between 128 and 256 inclusive, with grid comprising >= 100 of these blocks, so on the order of 10,000 to 30,000 threads in total. The GTX 1070 has 2432 CUDA cores that you want to keep busy at all times.

The major challenge in parallel random number generation is to create uncorrelated streams of random numbers of sufficient length. A long-period serial PRNG can be re-used for that, however its output needs to be partitioned into non-overlapping subsets, either by block splitting or leap frogging. This requires the capability to advance the PRNG by n states efficiently. This is trivial for some PRNGs and quite complicated and costly for others.

One value of CURAND is that it performs proper partitioning under the hood for the PRNGs it offers, so programmers don’t have to worry about doing it correctly. Among the generators supported by CURAND, Philox is usually the generator with the fastest state advancement, and typically the second fastest for actually generating a random number from the state information.

If you want to use a PRNG other than those offered by CURAND, you will have to figure out how to do arbitrary state advancement yourself, in order to parallelize the PRNG (assuming it offers a sufficiently long stream of random numbers in the first place).

Yeah I fixed that side of things and have been trying different combinations of threads and blocks but strangely it seems slower with more threads.

But it seems that trying to use this random algorithm in parallel is fundamentally flawed? It’s strange to me because it’s not a very complicated algorithm, and I would have thought that the actual encryption/decryption algorithm would have been more of a performance toll.

One thing I did try which I thought was clever was try to launch so many threads that each one got a smaller and smaller range of seeds to try, but it didn’t pan out that way.

You should try to increase the parallelism by using #of threads per block to be in multiples of 32. The device properties can be found using cudaDeviceProp.

Increased number of blocks ensures high sm_efficiency , also try to leverage the memory hierarchy alleviate memory bottlenecks.

Can you elaborate a little bit on what you mean by that? Do you mean the difference between global and shared memory areas which are faster and slower in the GPU?


Yes, shared memory will enable reuse of input data that is common across all the threads in a block. IIRC, you had mentioned that coefficients for your model are common. So you can even store them in constant or texture memory (device read-only), which can allow more parallelism.