Question about Fermi 2.1 architecture of SM(s) of 48 cores and warps of 32 threads (from a Newbie)

ChuckSommer · December 6, 2015, 6:44am

Hi Gang,

Quick introduction:
I am just learning about CUDA C programming.
I am about 1/3 through “Intro to Parallel Programming” from www.udacity.com
Am a new owner (about 3 months) of a GeForce GTX 980 ti (happily Folding@Home)
I have also been looking at the NVIDIA Programming Guide (ToolKit 7.5)
I have been reading “Professional CUDA(R) C Programming” John Cheng, Max Grossman, Ty McKercher.
And, I just spent a few hours reading:
“Optimizing Parallel Reduction In CUDA” by Mark Harris NVIDIA Developer Technology (see link to PDF)
https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

I have been reading about CUDA Warp(s) but I am a little confused about something.
In Fermi (2.0) Kepler(3.x) and Maxwell (5.x) the SM(s) each have a multiple of 32 CUDA Cores:
Fermi (2.0) = 32 Cores/SM
Kepler (3.x) = 192 Cores/SM
Maxwell (5.x) = 128 Cores/SM
However Fermi (2.1) architecture has 48 Cores/SM.
As a Warp is 32 threads and the programmer can count on the threads within a warp to have consecutive
thread id(s) in groups of 32, how are the warp(s) mapped to SMs of 48 cores?

Mark Harris’s paper talks about “instructions are SIMD synchronous within a warp” on page 21.

The warp is a basic building block in the CUDA execution model.

I know this question is somewhat academic as I probably won’t get access to a Fermi 2.1 board today
unless I purchase a “GeForce GT 610” just to play.

Thanks Chuck

p.s. I would like to have been able to try to search through this Forum for topics that might cover this question, but there is no tool and over 30k topics under CUDA programming and performance.

RoBiK · December 6, 2015, 1:19pm

Fermi architecture uses two different clocks, the shader clock (hot clock) runs at double the frequency of the regular clock. Each warp is executed over two hot clock cycles - one half-warp per cycle. Those 48 floating point compute units run at the hot clock frequency, so they are able to service 3 half-warps per clock or put another way three warps over the span of two clocks.
I hope you can see how the math works out just fine :)

ChuckSommer · December 6, 2015, 9:10pm

Hi RoBiK,

Thanks Much.

I was looking around lat last night and stumbled onto
: NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf (I may have modified the file name)

And there in the section “SMX Processing Core Architecture” (page 9?) I saw the following
: Similar to GK104 SMX units, the cores within the new GK110 SMX units use the primary GPU clock
: rather than the 2x shader clock. Recall the 2x shader clock was introduced in the
: G80 Tesla‐architecture GPUand used in all subsequent Tesla‐ and Fermi‐architecture GPUs.

So your comment verified that I was correctly interpreting what I had found.
( Which I had until just now only speculated ).
Thus the Fermi CUDA Core is twice as power as I had thought.
And only 16 physical cores are required to run a 32 thread Warp.

Again Much thanks for making this clear, I may just run out and get that Fermi 2.1 Compute Capability
card just to play with.

Chuck

Topic		Replies	Views
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19466	July 5, 2011
Fermi architecture CUDA Programming and Performance	2	735	May 24, 2011
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13202	January 22, 2010
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28594	July 4, 2019
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2567	November 1, 2012
A question about the correspondence between warp and core CUDA Programming and Performance	17	7737	February 1, 2019
GTC Keynote Thread CUDA Programming and Performance	31	19813	May 23, 2012
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10327	April 5, 2012
Why GK110 has 192 cores but 4 warps? CUDA Programming and Performance	8	5276	June 6, 2012
Scheduling on Fermi CUDA Programming and Performance	16	17540	August 9, 2010

Question about Fermi 2.1 architecture of SM(s) of 48 cores and warps of 32 threads (from a Newbie)

Related topics