Multiprocessors or Cuda Cores

Hi ! I’m totally new with CUDA.
I would like to know what the differences are between Multiprocessors and Cuda Cores ? I’m using a GTS 450 with 4 Multiprocessors and 192 CUDA cores. In this architecture, how many threads processors could be used ?

I understood that blocks are divided by warps (32 threads), the hardware is switching between warps to execute parallel computing and I would like to know how I could get the best performances and so how could I choose the best parameters well enough in kernel function (KernelFunc<<<param1,param2>>>)?

I know that I can’t use more than 1024 threads (2.1 compute capability) in a block and a grid can’t contain more than 65536 blocs, is it useful to use all the threads available ? How can I adapt these parameters ?

I’ve found some examples with :
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
Is it enough to get good performances ?

I hope you can help me a little, I’m still a student and I’m really interested in CUDA, if you have some information do not hesitate !

There is a nice book called “CUDA by example” by Jason Sanders and Edward Kandrot. I suggest you read that book first. If you are not ready to buy that book then I suggest you to read the 2nd and 3rd chapters of the CUDA programming guide before starting to code.

In fact, I already read the first chapters of this book but not too deeply of course and I’ve also written basic codes which were explained in chapters of programming CUDA. But I was more concerned about adapting parameters with my architecture

The optimum number threads in a block depends on your kernel but between 128 and 256 is usually a good starting point. So its a good idea to make it easy to change so you can experiment. Can have 65536*65536 blocks in a grid. i.e. a 2 dimension grid.

Depends on the application, sometimes having millions of threads is way to go, sometimes having each thread do multiple parts of the problem is better. Generally best to have at least several thousand threads, but if you application suits having several million then thats not a worry.

Thanks for your reply ! Ok so it depends on my applications.

I’ve got another questions. I’m quite confused with some names. I know that my GPU has 4 multiprocessors but how many blocks and thread are contained in a multiprocessor ? I’ve checked with cudaGetDeviceProperties and I found 1024 threads max per block and 65535 block by dim per grid but does it have the same meaning with the architecture ? What is a grid in term of hardware or what is the link between the grid and the multiprocessor ? It may sound simple but I don’t really get it and books don’t speak too much about it, please help me.

The whole thread/block/grid abstraction is a bit confusing at first, but keep at it and it’ll click soon!

I think of blocks and grids as more of a conceptual abstraction, which allows you to logically divide up your work so you the hardware can come along and conveniently “pick up” chunks (blocks) of work to execute. Based on how many threads you define in your block, the hardware knows exactly how much resources (memory, etc) needs to be free before it can come and grab another “bag of work”.

If your card has 4 Streaming Multiprocessors (SMs) then you can theoretically work on 32 blocks simultaneously (hardware max of 8 blocks per SM), which is 32768 threads simultaneously. So if you have a very large problem that you divide into a grid of 3200 blocks (remember, this is a conceptual division, not that many blocks will be in flight at any one time), the hardware will load up 32 of the 3200 blocks and start work. As soon as one of those 32 blocks get done, it picks another from the remaining 3168 blocks. This continues until all 3200 blocks have been worked on. If each block took an identical amount of time to run, then theoretically each SP would swap blocks exactly ten times. In reality, if one of the blocks takes ages to run (e.g. the data in it required many, many iterations for local convergence), then it might occupy the SP while the other 31 SP’s went about swapping blocks in and out to get through all the 3200 blocks needing to be run.

However, 32 concurrent blocks is the theoretical max, assumes that there are sufficient resources in each SM to have 8 blocks in flight. The actual number of blocks in flight on each SM will depend on how hungry each block is for resources such as shared memory and registers. This is where those device properties come into play: shared mem per SM, registers per SM, warps per SM, etc. The hardware basically fills each SM with more blocks (up to max 8) until putting one more block in it exceeds one or more type of resource, be it registers, shared mem, or warps (this is where the Occupancy Calculator can tell you what the limiting factor is).

If you define your blocks such that even one block consumes more of a certain resource (threads, registers, shared mem) than a single SM can sustain, then your kernel will fail to launch.


    if you assign 1025 threads per block, it will not launch

    if you assign 1024 threads, and each thread uses 40 registers (and these don’t spill), it will not launch

    if you assign 50KB shared mem it will not launch.

The system will NOT automatically cut down the number of threads in a block from what you have defined in order to become launchable. I.e. if you ask for 1025 threads, it will not launch with 1024 in SM0, and “carry one” over to SM1, fill that up with 1023 from the next block, “carry two” over to SM2 and so on… It will just fail to launch.

Does that help?

Yea Wonderful I get it !!! Your comment is something that should be added in the programming guide, very useful !! Thanks !! Another thing, what is a CUDA core ?

I’ve found that my card is equipped with 4 multiprocessors with 48 units per SM so it means that I have 192 stream processors. So if I follow your advice, I have 1536 blocks and 1572864 threads which can work simultaneously, if it’s well used of course. I think stream processors and CUDA cores is the same thing, isn’t it ?

“Stream processors”, “multiprocessors”, “streaming multiprocessors” and “SMs” are the same thing, CUDA cores are different. So if your card has 4 multiprocessors (aka SMs) and is of compute capability 2.1, it will have 192 CUDA cores and be able to run at most 32 blocks at the same time.

Hi tera,

The nomenclature may have changed since I read it in a text book a while ago, but I was referring to SP = scalar processor, SM = streaming multiprocessor.

I have a C1060 which has 8 SPs in each SM (this is compute 1.3).

You are right in saying if the “S” is used for “streaming” then SP could = stream processor = streaming multiprocessor = SM. This is terribly confusing!

So when I used ‘SP’ above, I meant the scalar (thread) processors within each SM, which might be the equivalent of ‘cores’.

Though I’m not sure what the other 40 of 48 ‘cores’ per compute 2.1 SM are doing if it only picks up a max of 8 blocks?

Or for that matter what the other 24 of 32 ‘cores’ per compute 2.0 SM are doing.

It made sense for a 1.3 SM, which is how i learned since it was a 1-to-1 mapping with 8 SPs <–> 8 blocks per SM.

@Frstdies: In fact the two textbooks I used were ‘CUDA By Example’ as mentioned above, and also Kirk & Hwu, ‘Programming Massively Parallel Processors: A Hands-on Approach’ (

If you jump on iTunes U you can also pick up either the Stanford GPU course by Jared Hoberock and David Tarjan (CS193G), or the Harvard Extension School GPU course by Hanspeter Pfister (CSCI E-292), or both :) I think they both deal with compute 1.3, which is what I used anyway, but the lessons are really fantastic.

These few lines will give quick review of hardware
Hardware Execution
1] GPU can execute one or more Kernel grids (for fermi only)
2] SM can execute one or more thread blocks (max can be 8 as of now)
3] CUDA cores and other execution units in SM execute threads (1.x compute capability takes 4 cycles for a single instruction of warp i.e. 32 threads and 2.0 takes 2 cycles)
4] The SM executes threads in groups of 32 threads called a warp. This warps can be of different blocks as SM can have different number of blocks (i.e. 24/32 warps for devices which have max 768/1024 threads support for each SM )

As text books available talk more about old architectures and not about fermi. It is better to have a read on this whitepaper. I found it to be more useful in understanding fermi hardware and also cleared some of my doubts regarding old hardware.

Now I’m a bit confused… In CUDA C programmming my graphic card is described with 192 CUDA cores but on this site: people described it as 192 stream processors but Tera you told me that CUDA cores and stream processors must be different, is it an errror from this site? What is the real purpose of the CUDA cores ? I would have thought that if my graphic card has a compute capability of 2.1, more than 32 blocks would be running at the same time so it depends on the number of multiprocessors and not CUDA cores. If a GPU has compute capability of 2.1, it means the GPU has 48 cuda cores per multiprocessor. At the beginning I thought that each CUDA cores would run a warp (32 threads) simultaneously, but I must be wrong, mustn’t I?

OK, so in a few words, in those 4 multiprocessors, I have at most 8 blocks who are running simultaneously, and so 32 blocks. Multiprocessors are composed of 48 CUDA cores which execute threads. The SM executes threads in groups of 32 threads.

How many groups of thread are running simultaneaously per SM ?
How many threads are running simultaneaously in each block (32 threads for a warp or is it 1024 ?) ?
How threads are distributed per CUDA cores ? I’m asking this question because I don’t see the point to have those many CUDA cores per SM (48 ) and just use 8 blocks.

Please tell me if I’m wrong. I’m trying to make lots of efforts to understand all, I’m still a student.

I have to thank you all for helping me !!

A compute 2.1 SM can host up to 1536 threads per SM, which equates to 48 warps of 32 threads.

I’m not sure what the rest of the ‘cores’ are doing when 8 blocks are being worked on. But it probably has to do with scheduling or the compiler can get other cores to work on different parts of the kernel simultaneously? This is just speculation tho.

I’ve never read too much into it, as working with compute 1.3 devices, I always have a 1-to-1 mapping of blocks to SPs (scalar processors).

With your comments, I get an overall understanding of the architecture. Thank you !!

Does my graphic card belong to Fermi class ? The whitepaper provided by Praveen Kulkarni was very interesting.Cause of this whitepaper, I get the impression that my GPU is linked to Fermi even if it doesn’t have 64 KB for shared memory but just 48 K

This picture is very close to a fermi, don’t you think ?

Yes, I believe what you described is Fermi, since GT200 has only 8 SPs per SM.

I have not read the whitepaper, but what you describe as 64KB shared mem is configurable as either 48KB shared mem + 16KB L1 cache, or 16KB shared mem + 48KB L1 cache. You can read about this in Appendix F.4 of the Programming Guide. I guess they physically the same memory, but since you cannot devote all 64KB to sharedmem, the limit is quoted as 48KB.

Sorry, SP or stream processor or scalar processor indeed belongs on the other side and was what Nvidia now calls “core”. So the (in-)equality should have read

“multiprocessors”, “streaming multiprocessors” and “SMs” are one thing, “stream processors”, “scalar processors”, "“SPs” or “CUDA cores” are another thing.

No, there is no one-to-one mapping between cores/SPs and blocks, and there never was. The appearance of the number 8 in both cases is a pure coincidence.

The 8 cores/SPs of a 1.x SM take 4 cycles to issue an instruction for one warp of a block (32 threads).

On compute capability 2.x, the cores/SPs of an SM are grouped as 2 (compute 2.0) or 3 (compute 2.1) sets of 16 which take 2 cycles to issue 2 (compute 2.0) or 3 (compute 2.1) instructions from 2 warps.

It’s all described in section 5.2.3 and appendices F.3.1 and F.4.1 of the Programming Guide.

It means that I have 512 threads running at the same time per SM. I thought that I should have at least 1024 threads per SM if I wish that all my threads in a block (1024) could access to shared memory?