A question about the correspondence between warp and core

I know that a Streaming Multiprocessor (SM) contains 32 cores in Tesla C2050, and a warp is composed of 32 threads. My question comes from this webpage: caching - CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship? - Stack Overflow

In the accepted answer, the asker asked:

“The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.”

and the answerer answered:

“No, there can be many more than 32 threads “in flight” at the same time in a single SM.”

My question is: since there are only 32 cores in one SM, one core can handle one thread at one time and there are 32 threads in a warp, it is obvious that one SM can execute only one warp at the same time. Why did the answerer say “there can be many more than 32 threads “in flight” at the same time in a single SM.”? If it is true, there should be many more than 32 cores in a single SM needed to run so many threads. Assume the threads contains only simple instructions, no load/store, no special functions like trigonometry. I am very confused. Please help me with it. Thanks a lot.

PS1: please don’t reply that I don’t need to know so many details about the GPU chip. I need to know it because I have to.
PS2: please don’t ask me if I have read some manuals or materials. I searched almost everywhere in the internet but nowhere can give me a clear correspondence between warp and core.

Thank you again for your answer.

I guess you are wrong when you say that one core can handle only one thread at a time. Execution of instructions is pipelined and there may be a number of instructions in the same pipeline at a time - such as instructions from different threads. For 32 cores you may get more than 32 threads - such as 22 warps according to the CUDA C Programming Guide that you read. Page 68.

Thank you vvolkov.

But two problems:
(1) By “pipeline” do you mean a core executes a thread for a small interval of time and then switches to another thread and run it for another small interval of time and then switches again, so that from a overall effects it seems that one core can run multiple threads? But, even so, at a particular time instant, only one thread is running on one core and totally 32 threads are running on 32 cores (i.e., on the single SM). To put it another way, is it the same as a single core CPU running many applications like MS Word, Internet Explorer, visual studio, on Windows 7?

(2) If my understanding of (1) is right, I will ask a lot more questions, let’s first settle (1) at the moment. Thank you.

PS: I didn’t find anything useful to my question in Page 68 of the “CUDA C Programming Guide”, although I did see the “22 warps” in this page.

No, (1) is not what pipeline is.

Suppose you have an circuit that takes 1ns until a change in the input results in the corresponding change in the output. You can run it at 1 GHz. Pipelining is a trick used to increase the clock rate - you find how far the signal propagates in the first 0.5ns and latch it into some storage cell. Now, after the first 0.5ns expires, you can submit a new input to the first half of the circuit without interfering with the previous signal propagating through the second half of the circuit. You’ll have two operations in the pipeline. No switching.

Check also Wikipedia article “Instruction pipeline”.

Thank you very much vvolkov.

So, by pipeline, a core can execute multiple threads at the same time. Are these threads running simultaneously in a core all come from the same warp, or from different warps? If the latter is the true, must these different warps come from a same block, or may from different blocks within a same SM? (I think can’t from blocks residing in different SMs.) Thank you again.

Ha-ha. We are getting into the details that NVIDIA is violently against publishing - one way is possibly more efficient than another, and knowing which is what costs money in this highly competitive market.

Each cuda chip has a certain compute capability version.

The compute capability version indicates the “max resident threads” and “max resident warps”.

See documentation for these versions and maximums.

“In flight” is referring to these maximums/storage spaces on chip.

I’d rather say that “in flight” usually refers to warps that have an instruction in some stage of any of the pipelines.

zzzhhh: Don’t get confused by the term “core”. It is a misleading marketing term that Nvidia has coined to suggest a vast superiority of it’s GPUs over CPUs as the number of “cores” on a GPU is so much greater than the number of cores on a CPU.
However, what Nvidia calls a “core” (at least where it has updated the documentation to use it) is not in any way similar to a CPU core. Instead it refers to what everywhere else is called a floating point unit or FPU. Of which even CPU cores normally have more than one unless they are very old or low-power.

I think of the “CUDA core” as a kind of “deeply pipelined ALU”, since it also executes integer, logical, and branch instructions in addition to the floating point variety. Load/store and special functions are handled by units separate from the CUDA cores.

Even that is a bit of a lie, since the CUDA cores are arranged in groups (16 on Kepler) that can only accept threads from the same warp into the pipeline in a given clock cycle. That is to say: the scheduler cannot send thread 0 to CUDA core 0 and thread 37 to CUDA core 1 in the same clock cycle. So in reality, a CUDA core on Kepler is 1/16 of a pipelined vector ALU, and the vector ALU executes 32-wide SIMD instructions. The SMX on Kepler then has 12 of these vector ALUs, plus some other specialized vector units.

As an aside, I do hope that someone in the basement of NVIDIA R&D is thinking about what happens if you break the currently very rigid thread warp model (which basically derives from limitations of the underlying hardware SIMD implementation). There are a number of simulation applications that would be improved if, for example, there were a __syncthreads()-like execution barrier that would allow all the threads in a block to be reassigned to new warps based on some condition. Imagine being able to write something like this:

...
int choice = select_random_outcome(&rngState);

// This sorts all the threads in the block by the value of choice and 
// slices them in groups of warpSize to be packed into the warps that will be 
// used after this line
__warp_regroup(choice);

if (choice == ABSORB)
    do_absorb();
else if (choice == REFRACT)
    do_refract();
else if (choice == REFLECT)
    do_reflect();
...

Such a feature would mitigate an entire class of branch divergence problems in current code.

Interesting concept. But very expensive to implement as all register content needs to be moved to the new location.

I don’t think there is much the hardware could do to speed up such operations. I.e. I think you could do it just as well on current hardware already, if you implement the thread migration in software.

Yeah, warp regrouping would be most effective if there was a way to do it quickly. I guess the software implementation looks like a sort of [choice, threadID] tuples, followed by something equivalent to a function call (push all the register state into memory) and a function return (pop all the register state out of memory, but mapping to new thread slots). A generic implementation would require a new PTX instruction, because only ptxas would know what registers needed to be preserved.

Another option would be to piggy-back this feature on whatever support there is for preemption on compute capability 3.5 devices, which is used to implement dynamic parallelism. A warp regrouping operation would be a suspend and restore of a block, but with the restoration assignments permuted.

This is something like COMPRESS instruction on vector processors - see for example [url]http://bitsavers.trailing-edge.com/pdf/cdc/cyber/cyber_200/Programming_Methodology_for_CDC_CYBER_205_Vector_Processor_Aug80.pdf[/url]. You give it a data vector and a bit vector to mark the data elements you care about, and it returns a new data vector with the selected elements placed next to each other. This way you may avoid moving entire thread contexts and move only the data involved.

Migration in software has to use shared memory, which is very slow on Kepler. They are now introducing some non-shared memory communication mechanism, such as in-warp shuffle. Here we are basically talking about shuffling data across a thread block.

However, there is a danger of making it too difficult for the programmer to use such new performance features. We have to keep it simple if we want to get a wider adoption. Think of GPU branching and memory gather - they could also be implemented in software using predication masks, vector loads and shuffles, but, possibly, it is good that we have them in hardware. May be devices such as __warp_regroup are a good line of thinking…

@Vasily - why do you say shared memory is very slow on kepler? can you please explain that in more detail?

thanks.

@eyalhir74 - Shared memory on Kepler is certainly slower per CUDA core than Fermi. According to the GK110 white paper, the shared memory bandwidth is increased to 256 bytes per clock cycle (for 64 bit reads) which it says is double the bandwidth of Fermi. However, that same bandwidth needs to service 6 times more CUDA cores than Fermi (cc 2.0, that is). So any section of code that is limited by shared memory bandwidth (as this imaginary warp regrouping operation would be) will run slower on Kepler than Fermi.

Thanks. However this would also hurt performance of a whole lot other code (I actually saw this on the K10 with some SDK samples and image processing code) such as reduction, convolutions etc…
What I want to say is that it probably affect a whole lot of other code - so why did this design
decision was taken for kepler? what’s the benefit? or is it that in overall the performance
should be higher?

thanks
Eyal

Well, higher bandwidth to shared memory would cost more in transistors and power. I suspect that someone at NVIDIA analyzed a bunch of CUDA applications and concluded that many kernels have a mix of shared memory loads and other instructions (address calculations, actual math, etc), so increasing the number of CUDA cores more than the shared memory bandwidth would be a reasonable tradeoff.

This also helps explain the motivation for the shuffle instruction for Kepler. It takes one of the use cases of shared memory (exchanging values between threads) and handles it inside the CUDA cores, reducing some pressure on the limited shared memory bandwidth.

Warp Assignment

  • one thread is assigned to one SP
  • SM has 8 SPs
  • warp has 32 threads
  • so a warp is executed in four steps

It looks like keeping a warp size of 32 for compatibility reason for me.

Those slides are even more outdated than this thread!

Only Compute Capability 1.x devices had 8 SPs per SM, and CC 1.x hasn’t been supported by CUDA for quite a while now.

But I guess the thread opener will not be confused by that anyway after more than five years.