Hi Gang,
Quick introduction:
I am just learning about CUDA C programming.
I am about 1/3 through “Intro to Parallel Programming” from www.udacity.com
Am a new owner (about 3 months) of a GeForce GTX 980 ti (happily Folding@Home)
I have also been looking at the NVIDIA Programming Guide (ToolKit 7.5)
I have been reading “Professional CUDA(R) C Programming” John Cheng, Max Grossman, Ty McKercher.
And, I just spent a few hours reading:
“Optimizing Parallel Reduction In CUDA” by Mark Harris NVIDIA Developer Technology (see link to PDF)
https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
I have been reading about CUDA Warp(s) but I am a little confused about something.
In Fermi (2.0) Kepler(3.x) and Maxwell (5.x) the SM(s) each have a multiple of 32 CUDA Cores:
Fermi (2.0) = 32 Cores/SM
Kepler (3.x) = 192 Cores/SM
Maxwell (5.x) = 128 Cores/SM
However Fermi (2.1) architecture has 48 Cores/SM.
As a Warp is 32 threads and the programmer can count on the threads within a warp to have consecutive
thread id(s) in groups of 32, how are the warp(s) mapped to SMs of 48 cores?
Mark Harris’s paper talks about “instructions are SIMD synchronous within a warp” on page 21.
The warp is a basic building block in the CUDA execution model.
I know this question is somewhat academic as I probably won’t get access to a Fermi 2.1 board today
unless I purchase a “GeForce GT 610” just to play.
Thanks Chuck
p.s. I would like to have been able to try to search through this Forum for topics that might cover this question, but there is no tool and over 30k topics under CUDA programming and performance.