WPA2-PSK implemented in CUDA

Hi there,

during the last weeks I’ve implemented the PBKDF2 algorithm in CUDA around a tool to pre-compute parts of the WPA/WPA2-PSK authentication phase. The CUDA-code is written in C and called from Python.

On my Macbook Pro with a 2x2.5Ghz CPU and a 8800M GT I get ~1.500 rounds per seconds, this equals around 12.000.000 rounds of SHA-1 per second.

Anyone to help with the code or provide benchmarks for other platforms is very welcome.

See the project at this link

I can get you benchmarks on 8800 GTS 512, 8800 GTX, Tesla C870, and GTX 280. I’m assuming I can get it compiled: What commands need to be run to run your standard benchmark and what values need to be reported? How can one select which GPU in the system to run on?

Thanks for your help. I’ve updated the svn with a ‘benchmark’ command.

Checkout the current revision from google-code, compile and install the module in the cpyrit directory and then use the pyrit_cli.py to get some results. The cpyrit module will always pick first gpu in the system (probably the display device) - picking a device is not implemented yet.

On my MacBook Pro with 8800M I get the following

CPU: Opteron 2218 / GPU Tesla C870

Benchmarking cores 'Standard CPU', 'Nvidia CUDA' 

Testing CPU-only core 'Standard CPU'...

10000 PMKs in 26.26 seconds: 380.86 PMKs/s

Result hash: ef747d123821851a9bd1d1e94ba048ac

Testing GPU core 'Nvidia CUDA'...

10000 PMKs in 1.84 seconds: 5425.56 PMKs/s

GPU performance: 5914.37 PMKs/s

CPU performance: 273.46 PMKs/s

Result hash: 3d5aebd3dbb89e1936316d7cfabaaeac

CPU: Intel quad Q9300 / GPU: GTX 280 (NVIDIA standard clocks)

Benchmarking cores 'Standard CPU', 'Nvidia CUDA' 

Testing CPU-only core 'Standard CPU'...

10000 PMKs in 12.55 seconds: 796.53 PMKs/s

Result hash: ef747d123821851a9bd1d1e94ba048ac

Testing GPU core 'Nvidia CUDA'...

^C^C^CTerminated

As you can see: it went into an infinite loop or otherwise had issues on the GTX 280. My own application had similar issues until I tested and debugged it on that hardware, too.

Both of these machines run 64 bit linux, so I had to add -Xcompiler “-fPIC” to the nvcc compile options.

Thanks a lot for giving me those results.

  • The Tesla C870 performs about 20x faster than a dual-core 2.5Ghz CPU on heavy instruction-bound tasks.
  • Your Tesla C870 performed almost 49.000.000 rounds of SHA-1 per second.
  • The GPU occupancyin your C870 numbers is quite low, while the GPU does 5.900 rounds per second, the overall performance is only 5.400 rounds per second. This is a problem in my code with fast GPUs. I’ve to look into that
  • The result in your C870 example is wrong, the result hash should be the same as in the CPU run. Also my problem

Can you give me some info why your code broke on a GTX 280 ? While I don’t own one myself, I may be able to fix the problem from here…

If it breaks on a faster card, my first guess is always that you’ve got a race condition somewhere.

I doubt this is a traditional race condition as the code around the kernel calls is single-threaded

Race condition somewhere in the kernel.

I just noticed that your make script compiled with the option -arch sm_11. Tesla C870 is only compute 1.0, hence the different result. Do you not do any error checking? Usually running a sm11 app on a compute 1.0 device results in a CUDA error. After changing the make to not build for sm11, the benchmark now goes into an infinite loop on the C870 too.

I’ve updated the GPGPU core with some error checking. The tool should now at least tell if there was a problem with executing the kernel.

Could someone please give current checkout another shot?

I get 403 forbidden when clicking the link in your first post.

Google is to blame then. pyrit.googlecode.com should work

Your error checking didn’t turn up anything interesting on my Tesla C870 system. I added some printf-style debugging to see what is going on. The kernel call just doesn’t seem to finish and the app waits forever in your cudaEventQuery loop.

The usual culprit for this is writing past the end of array memory. I re-compiled in device emulation mode and ran it for a while through valgrind, but no errors showed up in the kernel. I’ll let it run on that system overnight so it finishes and we’ll see.

I did discover one other disturbing bit of info with the device emulation build: it reports a different answer every time

./pyrit_cli.py benchmark

Benchmarking cores 'Standard CPU', 'Nvidia CUDA' 

Testing GPU core 'Nvidia CUDA'...

10000 PMKs in 28.33 seconds: 352.98 PMKs/s

GPU performance: 71.30 PMKs/s

CPU performance: 299.21 PMKs/s

Result hash: d3df945e80cc6a745c2c47ca174cb3c9 FAILED

[joaander@teslahoomd pyrit]$ ./pyrit_cli.py benchmark

Benchmarking cores 'Standard CPU', 'Nvidia CUDA' 

Testing GPU core 'Nvidia CUDA'...

10000 PMKs in 28.24 seconds: 354.09 PMKs/s

GPU performance: 35.41 PMKs/s

CPU performance: 406.27 PMKs/s

Result hash: e8c4637cf23ec88f2e50239d29546770 FAILED

[joaander@teslahoomd pyrit]$ ./pyrit_cli.py benchmark

Benchmarking cores 'Standard CPU', 'Nvidia CUDA' 

Testing GPU core 'Nvidia CUDA'...

10000 PMKs in 32.57 seconds: 307.04 PMKs/s

GPU performance: 61.42 PMKs/s

CPU performance: 355.46 PMKs/s

Result hash: 33d0bc12d9bd672f9d2fec1ab1867d77 FAILED

i’ve removed the shared memory usage from the kernel (which couldn’t hide memory latency anyhow). it should work on most platforms now.

It hangs in an infinite loop on my machine when running on GPU…

Same here on my 8800 GTX workstation. (Once my GTX 280 gets done with a job later tomorrow, I can try that too.)

Thanks for testing, I really appreciate that.

It’s hard to tell what went wrong with only having this small 8800M GT at my own hands :-\ However the cause for the hang may be indeed a racing condition between the kernel call and the sleep()-loop.

Could you try commenting the following line, so the code won’t sleep while waiting for the kernel. Copying back to results to host memory causes in implicit synchronization - this is based on polling the device however and burns CPU cycles.

Could you please comment this line and try compiling/benchmark:

while (cudaEventQuery(evt) == cudaErrorNotReady) { usleep(500); }

Hey,

I’m running the following:

Ubuntu 8.04 64bit

2.6.24-19-generic

CUDA 2.0

8800GTS

NVIDIA-Linux-x86_64-177.67 (Driver)

I am also experiencing the looping when the GPU benchmark is done.


Available cores: ‘Standard CPU’, ‘Nvidia CUDA’

Testing CPU-only core ‘Standard CPU’…

10000 PMKs in 10.80 seconds: 926.29 PMKs/s

Result hash: ef747d123821851a9bd1d1e94ba048ac OK

Testing GPU core ‘Nvidia CUDA’…

Terminated <— had to terminate it


I commented out this line

while (cudaEventQuery(evt) == cudaErrorNotReady) { usleep(500); }

and now get the following:


Available cores: ‘Standard CPU’, ‘Nvidia CUDA’

Testing CPU-only core ‘Standard CPU’…

10000 PMKs in 11.06 seconds: 903.85 PMKs/s

Result hash: ef747d123821851a9bd1d1e94ba048ac OK

Testing GPU core ‘Nvidia CUDA’…

Exception in thread Thread-1:

Traceback (most recent call last):

File “/usr/lib/python2.5/threading.py”, line 486, in __bootstrap_inner

self.run()

File “/usr/lib/python2.5/site-packages/cpyrit.py”, line 48, in run

res = self.func(self.workcontrol[3], pws)

SystemError: Kernel launch failed to complete.

10000 PMKs in 14.86 seconds: 672.87 PMKs/s

Traceback (most recent call last):

File “./pyrit_cli.py”, line 284, in

p.init(sys.argv)

File “./pyrit_cli.py”, line 118, in init

self.benchmark()

File “./pyrit_cli.py”, line 275, in benchmark

print "GPU performance: %.2f PMKs/s" % (core.gpu_perf[0] / core.gpu_perf[1])

ZeroDivisionError: integer division or modulo by zero


Hope it helps

Interesting. I’ve absolutely no idea why the code fails on those cards… Which version of CUDA do you use?

Anyway, the kernel now spits out better error messages

Running CUDA 2.0

Hopefully someone will have an idea on why this is happening.