Scheduling threads as Warps

srinivas3 · July 10, 2013, 7:00pm

I am a beginner in CUDA C. I am working on a toy project to learn CUDA using my GTX 680 card. The architecture of GTX 680 states that there 8 SMX’s in the GPU and each SMX has 192 physical cores, adding to a sum of 1536 CUDA cores. Considering that there are actually 4 warp schedulers in each SMX, this still means that only 128 threads can be executed in parallel per SMX. Does this mean the remaining 64 cores in the SMX are not used ?

Thank you!

kbam · July 11, 2013, 2:24am

All 192 cores will be used.

A good hint is to forget about your specific GPU
good code will run on your GPU or the far smaller one in my 2 yr old laptop without any changes,
both of which will run several blocks (each of 1 or more warps) at once.
Generally I write code for a thread, keeping in mind that when dealing with arrays its better for adjacent threads to be accessing adjacent cells (adjacent columns on same row)

I suggest looking at something like the nbody example in the SDK

Greg · July 11, 2013, 7:30am

Compute Capability 3.* devices have 4 warp schedulers per SMX. Each warp scheduler can dual-issue instructions for a total of 256 threads per cycle.

The SM has multiple types of execution units including:

Single precision floating point and integer units (192 CUDA cores)
Double precision floating point units
Special function units
Load store units
Texture units
Branch units

See the Fermi and Kepler whitepapers for additional information.

seibert · July 11, 2013, 9:42pm

“CUDA cores” are a useful marketing measure, but tell you very little about how an SMX executes instructions.

Take a look at Figure 2 in this whitepaper:

[url]http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf[/url]

At the top are the 4 schedulers, which pick from the set of available warps (groups of 32 threads) on the SMX. They do not pick individual threads for execution. Once each has selected a warp for execution, the dual-dispatchers then pick up to two independent instructions from the warp to issue to the appropriate pipelines. Each column of 16 “cores” is a warp-pipeline, which can finish one warp instruction every 2 clocks (but an instruction takes something like 10-20 clocks to complete from start to finish). As a result, each of the 12 pipelines only needs a new instruction every 2 clocks to stay full. This means the SMX needs to issue arithmetic instructions from 6 of the 8 dispatchers every clock to keep all the CUDA cores busy. Load/store and special functions are separate pipelines.

Topic		Replies	Views
How many thread are executed at the same time ? CUDA Programming and Performance	9	7891	January 21, 2024
Relationship between Threads and GPU core/units CUDA Programming and Performance	5	6428	November 21, 2015
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2498	July 4, 2019
Understanding CUDA scheduling CUDA Programming and Performance	4	15547	May 20, 2014
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12170	February 12, 2013
Thread Scheduling Concept CUDA Programming and Performance	3	3721	June 21, 2012
How to the A100 GPU’s maximum warps per scheduler CUDA Programming and Performance	3	380	July 17, 2024
Optimizing calculation speed (yet another blocks, threads, allocation question) CUDA Programming and Performance	2	1180	September 16, 2014
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2173	March 19, 2011
192 cuda cores - how they are organized 6x32 or 4x32 + 4x16? CUDA Programming and Performance	5	3168	April 29, 2012

Scheduling threads as Warps

Related topics