CUDA Use Cases run serial algorithms on composite data

I’ve got sort of a fundamental question about the use cases of CUDA:

(1) next to the obvious things (as far as I can tell from all the documentation, articles and examples I have seen so far) of using hundreds/thousands of small CUDA threads to work together on a single problem (e.g. matrix multiplication, each thread computes one element), I was wondering if following second use-case is also intended for CUDA:

(2)

  • take any serial algorithm that operates on some small data set (e.g. multiply a small matrix) and use just a single thread

  • then run the same (still serial) algorithm on all SPs in parallel (operating on a huge composite data set though, where each core will operate on its assigned small subset; in above example this would lead to multiplying many small matrices independently at the same time);

  • provided we have a card with 16 SPs and 8 cores each…can we expect a speed-up of 16x8 = 96 threads that can run in parallel then? (I heavily doubt that somehow…) Or will memory access patterns across those 96 threads still limit performance drastically?

→ do you think there’s any chance to succeed with use case (2), or is it doomed to fail?

thanks,
michael

This is a completely reasonable approach to increasing the data-parallelism of your program. I certainly applied this principle successfully in one of my programs.

The main obstacles here are memory layout and divergent code paths. If the serial algorithm has any branching which depends on the input, then you will have a problem when threads in the same warp want to go different directions. You really want every thread to take the same code path regardless of the input, even if this appears to waste some time. It is also important to layout the data in memory so that threads in a warp will read consecutive elements and get good coalescing. This often requires interleaving the input data.

If those two conditions are met, then the only limits on your performance will either be the FPU capability (# of SPs x clock rate) or memory bandwidth. (In most cases, memory bandwidth turns out to be the bottleneck.)

hi seibert,

glad to hear that.

but actually what I meant was a lil bit different:

I am aware of the fact that threads within a warp should not diverge or access uncoalesced memory.

so my question/idea was even simpler: let’s say my serial algorithm diverges a lot and/or uses uncoalesced memory → but instead of running hundreds of threads within one block, I basically run many blocks which only do have one single thread (thus, the warp size is “effectively 1” instead of 32).

→ will all those blocks run in true parallel? how well can that scale?

→ so the real question is how many “real parallel” units are there inside the GPU - is it # of SPs x # of cores (e.g. 16 x 8 = 96)?

→ if so, it would mean that I could execute 96 blocks (with just 1 thread each), but all of them could go different branches just in case (w/o losing performance, since they are “real parallel” units)

→ or to put it in another way: do these single threads from different blocks also have to use coalesced memory accesses and non-divergent branches ACROSS blocks? (or is this condition only to be met within single warps?)

hope I could express my problem/“idea” a little bit clearer this time.

thanks,

michael

p.s.: I know that this may not fully utilize the whole GPU power…however, if my data set can be organized in 96 chunks, I just want to figure out whether the same operation can be done on 96 datasets in the same time than 1 dataset would be handled by a single GPU thread (here I assume that all this 96 datasets fit into global memory at the same time).

Oh, I see what you are asking. Unfortunately, this form of usage will have horrible performance. The scheduling unit on the GPU is the warp, so if you only have 1 thread per block, you will waste 7 of the 8 stream processors in each multiprocessor. Even worse, that one stream processor will only be doing something useful 25% of the time, and you will get no memory coalescing.

You will be able to have multiple blocks running per multiprocessor, but that will only help hide a little memory latency.

So, no, you shouldn’t do this. :) Shoehorning task parallelism by this method into a data parallel architecture will not work well.

BTW, the reason for this is that the GPU only has one instruction decoder per multiprocessor. You should note that whole warps can have divergent code paths with no penalty, but unless you can fill up the warp with 32 threads of useful work (all running the same instruction), you will underutilize the GPU significantly.

You can’t use less than 16 threads per block in any useful way, thus you’re already at 1536 threads.

And that would be assuming no global memory accesses.

Global memory access has around at least 400 cycles latency, so unless you have at least about 10-100 times more threads than the above your performance will still be horrible (at least compared to the theoretic maximum, obviously depending on your calculation/global memory access ratio, uncoalesced accesses count about 16x - with only one thread per block all accesses would be uncoalesced btw.).

I managed to get about CPU speed with 128 block and 16 threads each (only 4 doing calcuation but using all for accesses to get coalescing) but unless it is just a small part of a larger algorithm you will of course want to be faster than the CPU.

Then you’re not in a good position. If your algorithm diverges a lot, and you can’t think of a new one, then there’s not much you can do. You’ll lose a good deal of performance, but no more than 16x (and probably less, depending on how much it really diverges). If you use uncoalesced memory, that is easier to work around. You’ve still got tons of registers (up to 128) to use to accelerate your serial code and some smem. If you use them well, you can remove memory bandwidth as the bottleneck. (But to do this really well you often have to make your loops quite convoluted.)

All, thanks for your replies. Actually, that running multiple instances of a pure serial algorithm on the GPU would give poor performance was my assumption as well after reading quite a bit of CUDA related stuff over the last weeks.

Couple of questions though regarding your comments:

→ how did you come up with that number of 25 % useful? is it because I assumed just 1 thread, and an instructions takes 4 cycles to complete? (therefore I would only have occupancy = 1 if I have 4x8=32 threads, right?)

→ what exactly do you refer to with “than the above”? it can hardly be referred to the 1,536 threads, since multiplying it by 10 or 100 would exceed the theoretical number of threads possible (12,288 in my case).

→ it was just an assumption; actually there is a lot of potential in the existing algorithm to be parallelized; I was just curious whether the “running multiple serial versions” could lead to good performance at all or not

→ how did u come up with the number of “maximum performance you lose is 16x”? (I would understand sthg like warpSize 32, since 32 threads could be run in parallel even for divergent threads (provided that I have enough (>32) threads)?)

→ up to 128 register PER what? afaik, the total number of registers is 8,192 per block…

Apart from that - I was trying to run multiple serial versions of the algorithm at the same time just to see how many algorithms can be run at the same time, where the execution time is still the same as for a single run. In that particular algorithm it seems to scale up to 64 parallel runs (64 blocks, just 1 thread each) of the single algorithm. Only after using more than 64 blocks at once the total execution time became significantly worse. However, after your comments I would have assumed that it could only scale up to the number of MPs (which is 16x in my case)…; how comes that it is faster than expected then?? (I’m only referring to the kernel execution time, no memory read/writes included there)

Thanks for your input,

Michael

P.S: note, I realized that my intended use case is a very bad idea. I’m gonna develop a full parallel version next…

If your algorithm doesn’t diverge a lot and its accesses can be more or less coallesced, then running multiple instances of a serial algorithm can be an excellent strategy.

I think you’re right, you’ll lose 32x performance. Sometimes the granulity is 32 threads, sometimes 16. But for divergence I think it’s 32. (For coalescing, bank conflicts, and cmem serializations, it’s 16.) But you’ll need something like a 32-way case statement or 5-deep nested divergent ifs to cause this kind of absolute divergence.

128 registers per thread. From your serial thread, you’ve got a max of 128 registers (even if you use 1 thread per block).

That sounds very strange. So you got linear increase in performance by scaling up to 64 1-thread blocks, and then you started to significantly lose performance as you increased the # of blocks further?? That’s completely unexpected, if it’s true that you’re not performing memory read/writes inside the kernel. I would expect, as you say, linear scaling to 16 and then flatness with additional blocks.

I need to amend that statement slightly. Numerous comments from NVIDIA employees in the forum have suggested that the half-warp (16-threads) is actually the hardware scheduling unit rather than the full warp. However, we have been encouraged in the documentation to think of the full warp of 32 threads as the scheduling unit because future devices might enforce that in hardware.

Anyway, going with the 32 threads concept (even if it isn’t quite what the hardware does), I get that 25% useful figure by observing that there are 8 stream processors per multiprocessor. So when a multiprocessor is running a warp, it actually folds up and pipelines the 4 instructions into each of the 8 stream processors. If the warp only has 1 active thread in it, 7 of those stream processors will do nothing at all, and 1 of those stream processors will have a single thread + 3 no-ops behind it. (If the half-warp is the true scheduling unit, then there will only be one no-op.)

I guess I expressed myself not clear enough here. I was trying to say that my measurements are only including the kernel execution time, and do not include memory transfers.

However, the kernel itself uses a whole bunch of uncoalesced memory reads/writes. That might explain that weird situation.

Yes, definately. When running multiple blocks, one block’s memory accesses overlap with another’s computations, so you can’t measure just one.

alright, I see. so is it correct then that the GPU can only run #ofMPs blocks concurrently? (e.g. 16 blocks in my case?)

(of course, whenever a block is requesting memory access, another block might be scheduled in the meantime to do some work…but still, at the same time, only #ofMPs blocks are executed - correct?

That’s right. Within a given cycle, an MP can only be performing one instruction. A lot of the limitations in CUDA are derived from this fact, as well as most of its benefits.