I want to test GPUs & CUDA stuff for massive computation (2^40 and more) but only with simple instructions : some XORs, some shifts, etc. (all on unsigned char).
So far, a computation on a middle-class CPU is still faster than my computations with CUDA (on Tesla and GeForce GTX295) and I want to inverse that.
I’ve red many things about what to do (or not) with CUDA (for exemple is the Best Practice Guide or in this forum) and there are the clues I have :
I still don’t really know if there’s an issue or not using and computing with “unsigned char” variables. I’d gladly use floats, but for all that easy stuff (XOR, AND, etc.) I really don’t know if that’s a good idea. Nevertheless, is that correct that a char variable will automatically be converted into integer ? If that’s true, what’s the good way to perfom that kind of computation ?
My program test use a 256-bytes array I placed in constant memory. I have 20 threads executing the same simple (but algorithmically expansive) program, accessing that array. Is this the good thing to do ? I’m not sure to understand the concept of “constant cache”.
Should I use atomic operations ?
I’ve noticed the possible issues with host/device transfers so in my exemple there’s no input and just a big array in output.
I’ve also noticed the difference simple things could make, for example between a (var % 256) and a (var & 255).
There’re probably other things I didn’t get for the moment, but if you could give me some answers about what I wrote here (mainly about bitwise computation), that could just be very cool :)
Unfortunately, I can’t post any code, but I actually work on secret-key cryptography and more precisely on the use of GPUs for cryptanalysis, that’s why for now I just need a very fast massive computation (brute-force like).
Thanks in advance for your answers.