CUDA C Programming Guide is not clear on what cuda cores are (and how much they execute at a time).

Hello,

The “hardware” picture(s) inside the CUDA C Programming Guide are to “virtual” to “vague” to “abstract”, not “deep enough”.

It only shows “cores”…

But what is a “core” apperently a “multi-streaming-processor” is ment with “core” but this is unclear in the picture.

Also a “multi-streaming-processor” in reality seems to have “multiple cuda cores”. This is also not clear from the guide.

Also a “cuda core” can only execute 1 thread. This is also not clear from the guide.

“Warping” is apperently a “cuda core grouping” technology for more efficiency, this is somewhat clear but could be better.

I can understand that the guide tries to remain “general” because the architectures might change in the future and every gpu could be slightly different.

But giving some examples of “compute 1.x” hardware and “compute 2.x” hardware would be much better.

There are probably presentations on the internet which do show “subcores” inside “cores”.

I call it a “sub core” = cuda core.

Perhaps it could also be called a “thread core” which simply means it executes 1 thread at a time.

I call it a “core” = multi streaming processor. (a core can have multiple sub cores).

I think the guide needs to be a bit more clear about how the hardware is actually structured because I have seen many postings on this web forum of people being confused and making wrong conclusions.

So that’s why it’s extra important that the guide be totally clear on how the hardware is structured to take away these confusions and misconceptions ;)

Bye,
Skybuck.

Hello,

The “hardware” picture(s) inside the CUDA C Programming Guide are to “virtual” to “vague” to “abstract”, not “deep enough”.

It only shows “cores”…

But what is a “core” apperently a “multi-streaming-processor” is ment with “core” but this is unclear in the picture.

Also a “multi-streaming-processor” in reality seems to have “multiple cuda cores”. This is also not clear from the guide.

Also a “cuda core” can only execute 1 thread. This is also not clear from the guide.

“Warping” is apperently a “cuda core grouping” technology for more efficiency, this is somewhat clear but could be better.

I can understand that the guide tries to remain “general” because the architectures might change in the future and every gpu could be slightly different.

But giving some examples of “compute 1.x” hardware and “compute 2.x” hardware would be much better.

There are probably presentations on the internet which do show “subcores” inside “cores”.

I call it a “sub core” = cuda core.

Perhaps it could also be called a “thread core” which simply means it executes 1 thread at a time.

I call it a “core” = multi streaming processor. (a core can have multiple sub cores).

I think the guide needs to be a bit more clear about how the hardware is actually structured because I have seen many postings on this web forum of people being confused and making wrong conclusions.

So that’s why it’s extra important that the guide be totally clear on how the hardware is structured to take away these confusions and misconceptions ;)

Bye,
Skybuck.

I agree with that. This easily makes one thinks NVIDIA is hiding something. Someone said “core” is an confusing thing although it makes the market huge success. It also makes programmers, esp beginners, hard to understand cuda. Anyway, NVIDIA did do a lot to make it work and better. But still a lot need to improve.

Except for “cores”, memory is also hard to handle, esp. for device 1.x . There is no stable way, as far as I know, to dynamically allocate device memory in device functions in device 1.x . Accordingly, in some cases, computation accuracy has to be reduced.

I think there are two reasons for the confusion:

  1. Marketing (in the good and bad sense): Using the term “core” makes people think about CPU cores, and so telling people that you have 512 cores gives them some idea about the number of simultaneous calculations you can do. This is partially good marketing, because it means you can get people’s attention without having to bore them by reading Chapter 2 of the programming guide to them. Of course, the term “core” has lots of baggage from the CPU world, and much programming confusion comes from incorrectly applying those concepts to CUDA hardware.

  2. Evolution in the way that NVIDIA talks about the CUDA architecture. Over the past 4 years, the documentation has slowly changed in the way that the hardware is described. In the beginning, there were SMs (“streaming multiprocessors”) and SPs (“streaming processor”). “Processor” has much of the same baggage and misconceptions as the term “CUDA core,” but there was also a lot more discussion about how these things related to each other. It is worth going back and reading some of the Anandtech articles about the G80 architecture, which I think helped solidify my understanding of how all the parts related. Things have changed since then, but it provides a good foundation.

For several years, the relation of SPs and SMs were fixed, since all compute capability 1.x devices basically worked like the G80 description with some minor changes in memory coalescing, atomics, and double precision. Fermi reorganized everything, and compute capability 2.0 and 2.1 are even quite different from each other. In light of this, there has been a general push in the documentation to deemphasize CUDA Cores and talk about multiprocessors and throughput in terms of warps. I think this is the right idea, because it is clear that people tend to obsess over the CUDA core details and make themselves crazy, without gaining anything in terms of practical programming knowledge. You’ll find that CUDA cores are no longer referenced in very many places in the CUDA Programming Guide, and this might be why the device properties structure doesn’t even bother to include an entry for number of CUDA cores per multiprocessor.

If you forget about CUDA cores, you can simply think of compute capability 1.x as composed of multiprocessors that each complete a warp instruction every 4 clocks. A compute capability 2.0 multiprocessor can complete two warp instructions every two clocks, as long as each instruction is from a different warp. A compute capability 2.1 multiprocessor is the same as 2.0, plus it can dual issue a second, independent instruction from one of the warps, giving it a peak throughput of 3 warp instructions every two clocks.

I do have one request though: As long as NVIDIA is going to call the 32 ALUs inside a Fermi multiprocessor “CUDA cores”, PLEASE PLEASE do not start using the term “core” in random conversations to refer to multiprocessors, even if that is technically more correct. We don’t need to add to the terminology confusion by using the massively overloaded term “core” to refer to two different layers of hardware. That will only make things worse. :)

For a programmer it’s very important to understand what the basic unit of computation is.

Apperently in cuda 4.0 this is a “cuda core”.

If that term is confusing then name it something else perhaps a “thread core” would be nice.

This means the tiniest possible execution unit which can execute a single thread.

Also I am just getting used to “cuda core” I think it has a nice ring/sound to it… because that’s what cuda is… it can execute a whole bunch of tiny little cuda threads on it’s cuda cores.

The problem/confusing is probably with what a “multi processor” is.

Perhaps a better term for it would be a “super cuda core” or a “multi cuda core”.

Just to illustrate that it’s something bigger, a bigger part which contains the more little parts.

Or even more simple and more consistent:

  1. “A cuda grid processor” the gpu itself.

  2. “A cuda block processor” (multi-stream processor)

  3. “A cuda thread processor” (cuda core, scalar processor)