CUDA SUCKS!!! Why <block, thread> cannot be judged by itself

What I ment with 32 bit/parallel is this: Before intel/amd processors had SSE they could already process 32 bits in parallel, examples are: AND, OR, XOR and so forth.

So in a sense intel/amd/x86 processors are also “8/16/32/64 bit parallel processors” but my point is basically nobody calls them “parallel processors” just because they process bits in parallel.

The same might or might not be said for SSE instructions… it depends how “complete” such an instruction set is. Concerning SSE… if each word can be individually acted upon in a logical flow it becomes more interesting.

Back to the 8/16/32/64 bit example… As far as I know it’s not possible to make the instruction set behave in such a way that each bit can have it’s own branching code and such. So in a way these bits belong together. And that basically makes it less parallel.

Now how does this translate back to cuda ? All those 32 threads/warp size must act on the same piece of data… pretty much the same way as 8/16/32/64 bits or SSE if you will… so it’s basically the same thing, though slightly different. If only 1 thread of each warp would be used then I am pretty sure performance would suffer greatly… even those the ammount of work would be the same, instead of 32 threads, there would be 1 thread per block, thus 32 blocks, instead of 1 block, and so forth… so by increasing blocks, and decreasing threads workload can remain the same, however I pretty sure performance would suffer… this basically indicates that these 32 threads belong together and in some sense cannot really be called “parallel”. In some ways perhaps defining with “parallel” truely means might be somewhat helpfull. I’d like to think of “parallel” as more of “free to do whatever it wants”. So untangled by all kinds of restrictions… each thread should be able to do what it wants without severe performance penalities/restrictions. For now I remain unconvinced that these performance restrictions are not present, on the contrary somehow they must be present… the importance of getting the thread per block and such just right basically proves that too some point.

That is also something that is a bit sucky about cuda… the cuda api cannot figure out how to run optimally by itself… the programmer has to try and figure this out… the number of parameters involved is quite large perhaps compared to a cpu. However a CPU like intel’s is more of a “magic block box” full of tricks, hard to predict performance… cuda is a bit more clear… but not terribly clear. For example (this should probably be in a seperate thread but anyway):

What is actually limiting the bandwidth of my GT 520 and bandwidth test program ?

In other words:

How to calculate/determine maximum number of load instructions that can be executed ?

What is the bottleneck ? Is it number of instructions per second that can be executed ? Is it memory frequency ? Shader frequency ? Is that cuda frequency ?

Can each CUDA core issue it’s on load instruction ? Or are all cuda cores bottlenecked by the SM unit they are in ?

Also how many “computation” instructions can be executed per cuda core per second so.

I suspect that there is a difference between the number of loads versus other kinds of instructions…