Why more threads than thread processors?

I’m a 17 year old, high school student with an interest in mathematics, software and electronics.

I’m attempting to write a GPU equivalent of the “P7Viterbi” code from “HMMER2”, a computational genomics package for a school project (OK, so I am geekier than most students my age).

I read an article in the “Microprocessor Report” that very clearly described CUDA and the architecture of the 8800 series GPUs. However, I’m a bit confused about the usefulness of having more threads than thread processors.

If the GPU has only 128 thread processors, why have more than 128 threads?

If my understanding is correct, wouldn’t the additional threads just be waiting for an available thread processor?

I know why this is useful for traditional OS threads blocked on I/O. However, even if the time to switch threads is zero, I still don’t understand why have more threads than thread processors makes any sense.

Does having a pool of ready but waiting threads mask some memory latencies?

Hopefully, someone will clear up my confusion about threading. Be gentle, I’m not an aged and experienced boffin, after all!


If I were you, I would stay away from optimizing it until you have a slow, inefficient, plus reliable working version first. This could even be implemented in a scripting language. One side benefit to having a slow, reliable version is that you can use it to test your optimized version.

Yes, that is exactly what it does. Plus, you can scale up your program in the future to newer hardware that has more multiprocessors–but only if you have more threads than processors on your current platform.

Ask and you shall receive.

Knock and the door shall be opened for you.


The GPU is designed to interleave computation and memory accesses. So, while one thread is running and waiting for it’s memory read to happen, others can run and perform arithmetic. And the GPU is designed to do this on a massive scale, capable of keeping about 10,000 threads all interleaving at the same time.

Running more than 10,000 threads doesn’t give you any performance penalty either. As soon as some blocks finish, others that are waiting will jump to fill in the empty space. In fact, running more and more threads usually helps performance, as long as you have work for all of them.

As I understand it doesn’t matter whether you have a 128 thread processors as it is divided into 16 multiprocessors. And one block, with no matter how many threads, will run only on one multiprocssor!

Can anyone else confirm that this is true?

One more question. The shared memory is actually not a memory but a register array?

as 4096*4byte is exactly 16kb, which is the size of the local shared mem.

This is correct, you should think of threads and blocks as logical entities which are mapped to to the physical entities of stream processors and multiprocessors, respectively. You can (and should) have more threads per block than stream processors, and similarly you should have more blocks than multiprocessors.

The main difference between shared memory and the register array is that the shared memory supports indexing, whereas you cannot do that with registers. (At least, not with the instructions we have access to.) In terms of speed, the shared memory and the registers are the same, assuming there are no shared memory bank conflicts. One could easily imagine that in hardware they are implemented in a similar way.

In my experience, shared memory operations are slightly slower than register ones even without any bank conflicts, because they incur instruction overheads to load data from shared memory to registers, where actual computation is performed (more instructions are needed to store results back to shared memory). Other than that I think they’re equivalent.


Hi There!! Geeky skool boy!

I would like to endorse Mr.Anderson’s reply. That captures it all.

“Latency Hiding” is the essence of performance.