Block and thread scheduling/ordering questions

I have a few questions regarding block and thread scheduling.

  1. I know the CUDA programming guide says issue order of blocks within a grid and warps within a block are undefined. I understand this statement in general, but I would like to know whether the issue order in the Quadro FX 5600 or GeForce 8800GTX implementations are somewhat predictable and/or their reliability.

  2. In general, once a block begins execution on an SM, can it be preempted and scheduled later? I’m not talking about multiple blocks running on an SM simultaneously, but about one block’s state being saved to global memory, sleeps for a while, and then state is restored and it resumes execution on the same SM. Can I assume blocks, once started, run to completion?

  3. Even if the warp issue order is undefined, is there an efficient way to find which warp is issued first? I need one thread/block to do a global memory write, and I’d like to put that write in the first warp scheduled for performance reasons. Similarly, I’d like to have a thread in the last warp scheduled do a global memory read.

  1. Blocks run to completion once started. There is no preemption or state saving.
  2. With all the latency hiding and interleaved execution going on you cannot possibly discern the performance between a write from the “first” warp run and the “last” (note that first and last are completely ill-defined because all warps in a block run concurrently). Just have threadIdx.x==0 do the write and everything will be fine.

The best way to develop an algorithm in CUDA is to start by imagining that EVERY SINGLE thread calculates its result at EXACTLY THE SAME TIME. If you can imagine this and develop your algorithm so that it works with this model, you are home free (except for memory access patterns, but that’s a whole other can of worms).

I agree with your development process abstraction that every thread runs exactly concurrently to get the algorithm running, but now that I’ve done that I’d like to squeeze every last bit of performance out as possible. I consider microarchitecture knowledge is fair game once you’ve done the basic stuff like coalescing, bank conflicts, etc.

I see 2 big challenges to any interblock sync.

  1. Since blocks are not preempted, you cannot run more blocks than the device can execute concurrently. This severely limits performance, since the device doesn’t really start getting warmed up until you have 200+ blocks.
  2. Global memory bandwidth is only ~70GB/s and it is very precious. Having a bunch of blocks spin-waiting constantly reading from global memory is going to put a dent in that. Arithmetic operations are there in plenty, you can have many blocks re-calculate the same results as others (avoiding interblock communication) without any change in the running time since the arithmetic is overlapped with the memory reads. What matters in the end is wall-clock time, not how many calculations you save.

I did just think of an idea on how you could test the block execution order, though. Setup a simple calculation to, oh, calculate b[i] = a[i] * 2 to keep most of the threads busy. This kernel will be memory bound. Time it’s execution 10,000 (or more) times as a baseline (with CUDA_PROFILER, so you see gpu_time). Add one additional block to the calculation that spin-waits until the others are done. Time the execution for 10,000 (or more) runs and make a histogram.

The idea is that the earlier the spin-wait block starts, the more memory bandwidth it wastes translating into a longer kernel execution time. So, if the second histogram is broader than the first, then you know that your block is executed at mostly random times. If the 2nd histogram is shifted from the first, but narrow, then you know that the spin-wait block is run at some correlated time. Testing for correlations based on number of blocks, location of the block in the grid could be done: but I would guess it is asking for a headache.

Why doesn’t the device get warmed up until you have 200+ blocks? If each SM can support a maximum of 768 threads, and each block can have a max of 512 threads, why wouldn’t 2*number_of_sms blocks saturate the GPU? Why does it matter where the threads come from? Is it because of __threadsync()?

In most real application kernels I’ve written, the sweet spot for performance is a block size of 64 to 128. With a reasonable occupancy of ~50% that is 3 blocks per multiproc = 48 blocks concurrently on a GPU. It can chew through this tiny number of blocks so fast that the overhead of initializing the kernel to run costs more than the calculation. 200+ blocks actually gives the device a problem long enough to make the setup cost worth it. Try a few benchmarks yourself, you should see a trend of sublinear performance scaling up to ~100 or ~200 blocks where it transitions into a linear regime.

Of course, there seems to be some overhead at the high end too: 65000+ blocks seem to take much longer to run than they should, but that is a topic for another thread.

Though, in light of your comments, perhaps this entire post should be rephrased to say 10,000+ threads instead of 200+ blocks.

As said by Mr.Anderson, the warm-up thing is an empirical result. In my personal experience, when creating a fixed number of blocks, the peak performance on 8800 is achieved by feeding each multiprocessor 3 blocks (even with a poor 25% occupancy). And more often than not, it’s much more efficient to use GPU as a streaming processor (one thread for each element), with scan and reduce being the only exception I faced. I guess some of these may be due to GPU being a fine-grained parallel architecture or something.
To squeeze the last bit of performance, it may be more efficient to redesign the algorithm into a more streaming manner. e.g. in your genome matching algorithm, calculate the matrix on a per-line basis.
To test block issue order, write ptx and use %physid may be the most reliable method.

Can you elaborate on “calculate the matrix on a per-line basis”? I still trying to learn how to think in streams. The difficulty I have is that even in a single line of the matrix, each element is dependent on the element to its left. If I wanted to split the line across multiple blocks, I’d still have inter-block communication. Am I thinking about this the wrong way?

Sorry, I haven’t thought of it enough.
It should be a “per-diagonal” approach:
Pass id:
1 2 3 4 5
2 3 4 5
3 4 5
4 5
5
First pass computes 1, second pass computes 2, etc.
That would eliminate all inter-thread dependency.
If you have sufficiently many elements (which is likely for genomes). This multi-pass approach is very likely to be faster than a blocksync one.

What you describe is exactly what we plan to do. At first, we have very little parallelism, and as we get longer and longer diagonals we get more parallelism. In our case, what we plan to do is once we get a diagonal that exceeds the number of threads in our block, we’ll start to span it across blocks.

Eg. If we had 3 threads per block.

Block 0 will look like:
1 2 3 4 5 6 7 8 9 …
2 3 4 5 6 7 8 9 …
3 4 5 6 7 8 9 …

Block 1 will be
4 5 6 7 8 9 …
5 6 7 8 9 …
6 7 8 9 …

Block 2 will be
7 8 9 …
8 9 …
9 …

So block 1 has to wait until block 0 finishes diag 3 and writes the bottom-most result to global mem before it can consume that result and start its part of diagonal 4.

The alternative would be this, which requires global memory communication in two dimensions instead of just 1, but it doesn’t require any explicit block communication/sync beyond just doing multiple kernel calls.

[font=“Courier”]
k0_b0 | k1_b0 | k2_b0 <— Notation: Kernel 2 Block 0
1 2 3 | 4 5 6 | 7 8 9
2 3 4 | 5 6 7 | 8 9 a
3 4 5 | 6 7 8 | 9 a b

k1_b1 | k2_b1 | k3_b0
4 5 6 | 7 8 9 | a b c
5 6 7 | 8 9 a | b c d
6 7 8 | 9 a b | c d e

k2_b2 | k3_b1 | k4_b0
7 8 9 | a b c | d e f
8 9 a | b c d | e f g
9 a b | c d e | f g h[/font]

I also believe that to be safe, I have to __syncthreads() between each diagonal.

What I mean is to simply create 1 threads for each element.
e.g. 1 thread for 1, 2 threads for 2, etc.
Just let the number of threads grow to genome size and ignore the concept of blocks. To avoid excessive kernel launches, one can do the first few thousand diagonals on CPU. That’s likely faster and easier than block-sync.

We’d max out at 512 threads. At that point we could have each thread do multiple elements in the diagonal.

But since this is a project for class, and supposed to be a learning experience, we’d like to see if can exploit more parallelism by spanning blocks (SMs) as well.

Please let us know if you can have any speed up by letting each thread do multiple elements instead of one. Thanks.

Actually, doing it all within a block without having to go to global memory may not be possible at all if the diagonals are too large since we have to save 2 diagonals to shared memory for subsequent diagonals. Assuming we only had to save 1 word per element, the maximum diagonal size that would allow the block to fit in 16KB shared memory is 1365 (3*diagonals * 4 bytes * 1365elements = 16380B). I expect we’ll have diagonals on the order of 10K elements, and we’ll have to save more than 1 word/element.

Once a block has to spill to global memory, I’m guessing it won’t be much more overhead to have a different block read the data and continue work.