during the last weeks I’ve implemented the PBKDF2 algorithm in CUDA around a tool to pre-compute parts of the WPA/WPA2-PSK authentication phase. The CUDA-code is written in C and called from Python.
On my Macbook Pro with a 2x2.5Ghz CPU and a 8800M GT I get ~1.500 rounds per seconds, this equals around 12.000.000 rounds of SHA-1 per second.
Anyone to help with the code or provide benchmarks for other platforms is very welcome.
I can get you benchmarks on 8800 GTS 512, 8800 GTX, Tesla C870, and GTX 280. I’m assuming I can get it compiled: What commands need to be run to run your standard benchmark and what values need to be reported? How can one select which GPU in the system to run on?
Thanks for your help. I’ve updated the svn with a ‘benchmark’ command.
Checkout the current revision from google-code, compile and install the module in the cpyrit directory and then use the pyrit_cli.py to get some results. The cpyrit module will always pick first gpu in the system (probably the display device) - picking a device is not implemented yet.
As you can see: it went into an infinite loop or otherwise had issues on the GTX 280. My own application had similar issues until I tested and debugged it on that hardware, too.
Both of these machines run 64 bit linux, so I had to add -Xcompiler “-fPIC” to the nvcc compile options.
The Tesla C870 performs about 20x faster than a dual-core 2.5Ghz CPU on heavy instruction-bound tasks.
Your Tesla C870 performed almost 49.000.000 rounds of SHA-1 per second.
The GPU occupancyin your C870 numbers is quite low, while the GPU does 5.900 rounds per second, the overall performance is only 5.400 rounds per second. This is a problem in my code with fast GPUs. I’ve to look into that
The result in your C870 example is wrong, the result hash should be the same as in the CPU run. Also my problem
Can you give me some info why your code broke on a GTX 280 ? While I don’t own one myself, I may be able to fix the problem from here…
I just noticed that your make script compiled with the option -arch sm_11. Tesla C870 is only compute 1.0, hence the different result. Do you not do any error checking? Usually running a sm11 app on a compute 1.0 device results in a CUDA error. After changing the make to not build for sm11, the benchmark now goes into an infinite loop on the C870 too.
Your error checking didn’t turn up anything interesting on my Tesla C870 system. I added some printf-style debugging to see what is going on. The kernel call just doesn’t seem to finish and the app waits forever in your cudaEventQuery loop.
The usual culprit for this is writing past the end of array memory. I re-compiled in device emulation mode and ran it for a while through valgrind, but no errors showed up in the kernel. I’ll let it run on that system overnight so it finishes and we’ll see.
I did discover one other disturbing bit of info with the device emulation build: it reports a different answer every time
It’s hard to tell what went wrong with only having this small 8800M GT at my own hands :-\ However the cause for the hang may be indeed a racing condition between the kernel call and the sleep()-loop.
Could you try commenting the following line, so the code won’t sleep while waiting for the kernel. Copying back to results to host memory causes in implicit synchronization - this is based on polling the device however and burns CPU cycles.
Could you please comment this line and try compiling/benchmark:
while (cudaEventQuery(evt) == cudaErrorNotReady) { usleep(500); }