Massive "simple" computation with CUDA

Hi all,

I want to test GPUs & CUDA stuff for massive computation (2^40 and more) but only with simple instructions : some XORs, some shifts, etc. (all on unsigned char).

So far, a computation on a middle-class CPU is still faster than my computations with CUDA (on Tesla and GeForce GTX295) and I want to inverse that.

I’ve red many things about what to do (or not) with CUDA (for exemple is the Best Practice Guide or in this forum) and there are the clues I have :

  • I still don’t really know if there’s an issue or not using and computing with “unsigned char” variables. I’d gladly use floats, but for all that easy stuff (XOR, AND, etc.) I really don’t know if that’s a good idea. Nevertheless, is that correct that a char variable will automatically be converted into integer ? If that’s true, what’s the good way to perfom that kind of computation ?

  • My program test use a 256-bytes array I placed in constant memory. I have 20 threads executing the same simple (but algorithmically expansive) program, accessing that array. Is this the good thing to do ? I’m not sure to understand the concept of “constant cache”.

  • Should I use atomic operations ?

I’ve noticed the possible issues with host/device transfers so in my exemple there’s no input and just a big array in output.
I’ve also noticed the difference simple things could make, for example between a (var % 256) and a (var & 255).

There’re probably other things I didn’t get for the moment, but if you could give me some answers about what I wrote here (mainly about bitwise computation), that could just be very cool :)
Unfortunately, I can’t post any code, but I actually work on secret-key cryptography and more precisely on the use of GPUs for cryptanalysis, that’s why for now I just need a very fast massive computation (brute-force like).

Thanks in advance for your answers.

Only 20 threads? That isn’t anywhere near enough - that won’t even saturate a single warp. Try something more on the order of 2,000 or even 20,000 threads.

If I understand, your idea tells that more threads will change something in my example ?
I’ll just execute the same program more times in parallel…

My point is to understand why there’s a so big difference between my execution on a CPU and on a GPU, for a brute force program working on unsigned char.

Since my first post I’ve red more papers and guides about CUDA, I get it about memory, bandwidth, etc. but the difference remains really big.
Of course the principal idea could be to parallelize my program in order to fit more to GPU, but for the moment I just want to understand why I get a so big difference. In my understanding, it’s just bitwise simple operations, no memory exchanges.

Maybe I’m missing something ?

Thanks !

Sorry, I didn’t get that. Difference between what and what? And is it a good or bad difference? ;)

If you are talking about difference between (var % 256) and a (var & 255) I do not really know.

Modulo operations are really slow compared to an “and” operation, however since 256 is a literal, the compiler should be able to optimise it to and “and” operation nevertheless…

Unless you have at least 192 active threads per MP, your kernel won’t be able to optimally amortize instruction scheduling/pipeline overheads, which will cause stalls and reduce throughput.

I meant difference between my program execution on CPU and on GPU, for the same program.

Ok.

Actually, I just thought that if I use my GPU as a CPU (for example executing a program on a block, no optim and the same program on CPU) I could get something 2 results in the same scale in term of execution time between CPU/GPU.

The experience show that’s possibly wrong and moreover the gap is big enough to take my interest :)

So my first idea was to think that the use of unsigned char could significantely affect the execution compared with floats (there’re some lines about automatic convertion of char/short into integers in the programming guide).

I just have no example (in the sdk) of a program working on “little” variables like char, something quite common in cryptanalysis…

If you port your code from cpu to gpu, without any parallel parts in it, you’ll get horrible performance. The gpu is at its best when you execute many, many threads in parallel with well-designed memory access.

Transferring a large amount of data from the gpu to the cpu will have an big impact on performance as well.

So before looking at ‘should I use unsigned chars or floats’ you should look at how to go from 20 threads to millions of threads.

Of course million of threads could be the best for GPU, I’d like to use it always this way.

But unfortunately, there are some really sequential pieces of code that one can’t easily recode in order to fit the best for the GPU architecture…

Easy tasks are boring :)

Hope you will invent some good parallel code for your task and make a paper, a patent or something out of it :)

Or maybe you could share the problem with us or is it secret for some reason?

When a problem cannot be split up in independent parts, and these problems do exist, then gpgpu is probably not the answer. However, a large set of problems which look like they cannot be split up, can when you try a completely different approach.

That’s why I would recommend not to recode the sequential code, but to rethink and recode the approach.

For now, unfortunately, the problem can’t be shared for some secret reasons :)

But that make sense to rethink the whole approach for the GPU.
I’m just ‘testing’ what can be done with that before coding a really tricky problem.

Of course I have other ideas of really nice (but complex) cryptanalysis which could fit to GPUs, but ones working bitwise are so easy to understand that it’s frustrating to see the toughness of coding it on GPUs…

Bitwise operations are not a problem. They are as fast/slow as additions, making them work on 32-bit or 8-bit values makes no difference (as long as you can live with some bits in registers idling ;) )
Shared memory is usually used with 32-bit values too, but using 8-bit data types shouldn’t be that problematic either.

The biggest issues are:

  • amout of data to transfer between GPU and CPU
  • amout of threads running in parallel. GPU clock is slower than CPU clock so given single thread will work slower on GPU. It is the parallelism which counts.

That’s exactly my point !

In my example there’s no data to transfer in input and just an array (not so big…) in output.

About difference between CPU and GPU, that’s the beginning of my questioning because one can read in the Tesla’s board specification that the processor clock is 1296MHz. So for my 20 threads working on bitwise operations, I hoped an execution time in something like 3x slower than the same execution on a “middle-range” CPU…

Practical experience show things aren’t that simple…

So what is your practical experience?

You only said that “So far, a computation on a middle-class CPU is still faster than my computations with CUDA (on Tesla and GeForce GTX295) and I want to inverse that.” - but by how much? is GPU 3x slower, or more?

Anyway, as long as you don’t post any code (could be some simplified case or a simple component not necessairly covered by secrecy of your work) at all we can’t help you much, you know…

The processors on the GPU are relatively simple, in-order cores with deep pipelines. (I’m not even sure if there is branch prediction) By simplifying the processor (and sharing the fetch and instruction decoder between groups of 8 processors), NVIDIA was able to fit 240 on one die.

I would suggest you not think about the problem in terms of threads at first. The GPU is not optimized for “task parallelism” like a multicore CPU. It is optimized for “data parallelism” instead. If you can model the problem as transformations applied to a vector of data elements (with perhaps some exchange of partial results between elements) and if the number of data elements is large enough (1000 is a lower bound, vastly more is better), then the GPU might be a good fit.

The CUDA documentation uses terms like “processors” and “threads,” but I have found that this misleadingly causes people to apply their intuition from CPU multithreading to CUDA, which is usually wrong. The GPU is a throughput machine, and very nonlinear until you actively engage all the processing elements.