Question about Fermi 2.1 architecture of SM(s) of 48 cores and warps of 32 threads (from a Newbie)

Hi Gang,

Quick introduction:
I am just learning about CUDA C programming.
I am about 1/3 through “Intro to Parallel Programming” from www.udacity.com
Am a new owner (about 3 months) of a GeForce GTX 980 ti (happily Folding@Home)
I have also been looking at the NVIDIA Programming Guide (ToolKit 7.5)
I have been reading “Professional CUDA(R) C Programming” John Cheng, Max Grossman, Ty McKercher.
And, I just spent a few hours reading:
“Optimizing Parallel Reduction In CUDA” by Mark Harris NVIDIA Developer Technology (see link to PDF)
https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

I have been reading about CUDA Warp(s) but I am a little confused about something.
In Fermi (2.0) Kepler(3.x) and Maxwell (5.x) the SM(s) each have a multiple of 32 CUDA Cores:
Fermi (2.0) = 32 Cores/SM
Kepler (3.x) = 192 Cores/SM
Maxwell (5.x) = 128 Cores/SM
However Fermi (2.1) architecture has 48 Cores/SM.
As a Warp is 32 threads and the programmer can count on the threads within a warp to have consecutive
thread id(s) in groups of 32, how are the warp(s) mapped to SMs of 48 cores?

Mark Harris’s paper talks about “instructions are SIMD synchronous within a warp” on page 21.

The warp is a basic building block in the CUDA execution model.

I know this question is somewhat academic as I probably won’t get access to a Fermi 2.1 board today
unless I purchase a “GeForce GT 610” just to play.

Thanks Chuck

p.s. I would like to have been able to try to search through this Forum for topics that might cover this question, but there is no tool and over 30k topics under CUDA programming and performance.

Fermi architecture uses two different clocks, the shader clock (hot clock) runs at double the frequency of the regular clock. Each warp is executed over two hot clock cycles - one half-warp per cycle. Those 48 floating point compute units run at the hot clock frequency, so they are able to service 3 half-warps per clock or put another way three warps over the span of two clocks.
I hope you can see how the math works out just fine :)

Hi RoBiK,

Thanks Much.

I was looking around lat last night and stumbled onto
: NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf (I may have modified the file name)

And there in the section “SMX Processing Core Architecture” (page 9?) I saw the following
: Similar to GK104 SMX units, the cores within the new GK110 SMX units use the primary GPU clock
: rather than the 2x shader clock. Recall the 2x shader clock was introduced in the
: G80 Tesla‐architecture GPUand used in all subsequent Tesla‐ and Fermi‐architecture GPUs.

So your comment verified that I was correctly interpreting what I had found.
( Which I had until just now only speculated ).
Thus the Fermi CUDA Core is twice as power as I had thought.
And only 16 physical cores are required to run a 32 thread Warp.

Again Much thanks for making this clear, I may just run out and get that Fermi 2.1 Compute Capability
card just to play with.

Chuck