Impact of control flow on thread performance

I see in section 6.1.1.2 of the CUDA manual it is mentioned that control flow statements, like “if” can really reduce throughput if they cause different threads in a warp to have divergent execution paths.

I’d like to understand this issue a little better. Is the control flow problem that each processor within the block might not be executing the same instruction at the same time? Or is the problem that memory reads might not be issued at the same time across the warp and therefore not be coalesced into a larger read?

Specifically I have a kernel in mind which takes a large array of size N=1 million, exponentiates the arguments, and sums them together. (My application isn’t exactly that, but this is a simplified approximation to the problem.) My current implementation has each thread in a block calculate the sum of 1024 of these exponentials in shared memory, thread 0 in the block sums the results from the shared memory and writes it to global memory, and then the CPU reads the handful of values stored in global memory and gets the final sum.

Since computing the exponential takes many instructions, a simple optimization would be to skip computing the exponential if the argument is less than -50, since in my particular usage, that will contribute essentially nothing to the final answer. The problem with this optimization is that the arguments which are being exponentiated are not in any particular order in memory, so each thread will get out of sync as some threads skip exponentials while others do not. However, the probability of skipping an exponential is roughly uniform, so after 1024 sums, each thread will still reach the __syncthreads() call at nearly the same time.

Will this have a disasterous effect on throughput and why? (My GPU will arrive soon and I can test myself, but I’m hoping an expert can help me understand the architecture a little better.)

I have implemented a similar problem and solved this optimization issue by sorting the array. This way similar elements are stored close together. This makes threads that are numbered close together follow the same execution path which is more resource efficient.

To answer the “why”: G80 GPUs are made up of a number of multiprocessors. Each multiprocessor has multiple processors, but a shared instruction unit. Therefore if threads simultaneously active on a multiprocessor (a warp) take different branches, they have to be serialized (i.e. the threads that take the “if” get processed together, then the threads that take the “else” get processed together). If the branch granularity, on the other hand, is a multiple of the warp size, then no serialization has to occur, because all threads simultaneously active on the multiprocessor take the same branch.

Does that make sense?

Mark

Yes, the shared instruction unit was the part I was missing. I understand the limitation now. Thanks!

I’m still not sure if i got it right. What I understand so far: if there is an conditional

selection of n linear code blocks in one active thread tuple (warp) and the threads

take m<n of this blocks, then this m blocks get serialized with the corresponding

threads masked.

But this is on “warp” level. So am I right if I assume, that if all threads in warp i

takes conditional block 0 and all threads in warp j take conditional block 1 then

there is no serialisation ? But doesn’t that imply that every warp got his own

instruction pointer?

Example: block with 256 threads in 1d:

const unsigned int warpIdx=threadIdx.x>>5;

const unsigned int warpGroup=warpIdx%4;

if (warpGroup==0) do_this();

else if (warpGroup==1) do_that();

else if (warpGroup==2) do_something();

else do_anything();

So do I have grouped warp 4k+j (k=1…64, j=0…3) 4-times to work

in parallel on four different code blocks without serialisation?

But what happens if there are syncs in do_this() and do_that() but not in the

other two blocks?

Or am I totally wrong with this? Until recently I thought one block

runs on one multiproc and has one instruction pointer for als threads in this

block. There are some threads masked out because of conditional code and

if all threads in a warp are masked out, the whole warp is skipped.

So I can’t really imagine the architectural details of a GPGPU/DPPU:

now to mask nested conditionals and serialize things in hardware and

how to do synchronizing for reading and coalescing mem access…

This parallel stuff is really confusing. I never got to understand Tomasulu’s

superscalar stuff in conventional CPUs but G80 seams even harder…

BTW, seibert, rather than having thread 0 sum all 1024 values and write it to global memory, you should have all threads cooperate to compute the sum. This will be slow because single thead performance is poor. You should use a tree-based reduction such as the one in the scalarprod sample in the CUDA SDK.

Also, you could use __exp() to get a really fast exponential, because this maps directly to a hardware intrinsic instruction (of course the multi-cycle exp() will be more accurate, but much more expensive).

Mark

Oh sorry, my description was unclear. Each thread accumulates the sum of 1024 exponentials in a register and only writes that sum to shared memory. Thread 0 only has to sum up the 32 floats (one per thread) in shared memory at the end and write it to global memory. I have 32 blocks, so then the CPU only has to read back 32 floats to compute the final sum.

Incidentally, I was absolutely stunned to see how fast the 8800 GTX was. If I compare a CPU implementation with a GPU implementation that does the same work, the 8800 is 40 times faster! Once I compare the best CPU implementation I have (which can intelligently skip a lot of the calculation) to the best GPU implementation (also skips some steps, but not in the same way), the advantage drops to 8x, but that’s still amazingly great. (I’ll try your __exp() suggestion next to see if the diminished accuracy adversely affects anything.)

I want to say thanks to you and the rest of the engineers at NVIDIA! CUDA has saved my thesis. :)

(Now we just need the 64 bit Linux drivers, and we can start buying a few more of these boards…)

Hi, just to be sure…you say “If the branch granularity, on the other hand, is a multiple of the warp size, then no serialization has to occur”, but could this not by any chance be the half-warp size that you mean, because at any time the multiprocessor can only execute half-warp(16) threads at a time? It would seen logical the scheduler would than schedule half-warps…?

On a related matter(the scheduler)…when a warp, or maybe half-warp fetches data from the global memory, it has to wait, so the scheduler puts it in some wait queue and activates it again when the data fetch is complete…according to the tutorial this waiting takes up to 500 clock cycles or something like that…My question is:

while a fetch is in progress, can another warp launch a second request, and a third warp a four and so on…or not? Can only one chunk of data be gotten every 500 clock cycles or is it just alway 500 clock cycles late?

Keep on doing a great job!

Seibert,

I would consider the possibility of having a Block with 32 threads and multiple blocks for this problem. This eliminates need for
a) Double buffering (in some problems)
B) No need for __syncthreads() (increased performance)
c) No need to worry about race conditions (among warps. One can always use volatile keyword to synchronize shared mem amoong threads inside a WARP)

This might scale up your performance factor by 1.5x or 2x or more. If you have time, just give a try

Is this generally true? What I mean is, if you only calculate the exp on your global input data and add it to shared memory, your exp has time enough to run while fetching global memory I would guess?

Just wondering if I also have to add using __exp and friends to my benchmarking plan…

Dènis

Just wondering, but what is the range of the values in your array? Are they bounded ints? For example between -50 en 40? In that case, you could perhaps try a look-up table: pre-calculate these values and store them in an array (in shared memory or even a constant global array?) and retrieve the value you need i.s.o. calculating each value.
No garantee this brainwave turns out to be a brainfart ;-)

Hah, it seems an old thread has been suddenly revived. Thanks for the suggestions, although this project is long past. :)

In this instance, it turns out I had not yet learned an important CUDA lesson: Arithmetic speed is not as important as you think. The performance of my code had very little to do with the time taken by calling exp(), and everything to do with global memory latency. The math was effectively free once you factored in the time spent on read from global memory.

In case people are curious: The first solution was to switch to a packed, 16-bit fixed point representation of my data, which had been 32-bit floats, cutting the number of reads in half. (My data elements come from physical measurements which have 4 significant digits and cover a narrow dynamic range.) The next optimization was to cut the reads in half again by taking some of the data which was often repeated in a global memory array, and packing it into a shared memory table. The final fix was to batch calculations, so one loop through global memory could service 10 separate calculations at once.