Scheduling threads as Warps

I am a beginner in CUDA C. I am working on a toy project to learn CUDA using my GTX 680 card. The architecture of GTX 680 states that there 8 SMX’s in the GPU and each SMX has 192 physical cores, adding to a sum of 1536 CUDA cores. Considering that there are actually 4 warp schedulers in each SMX, this still means that only 128 threads can be executed in parallel per SMX. Does this mean the remaining 64 cores in the SMX are not used ?

Thank you!

All 192 cores will be used.

A good hint is to forget about your specific GPU
good code will run on your GPU or the far smaller one in my 2 yr old laptop without any changes,
both of which will run several blocks (each of 1 or more warps) at once.
Generally I write code for a thread, keeping in mind that when dealing with arrays its better for adjacent threads to be accessing adjacent cells (adjacent columns on same row)

I suggest looking at something like the nbody example in the SDK

Compute Capability 3.* devices have 4 warp schedulers per SMX. Each warp scheduler can dual-issue instructions for a total of 256 threads per cycle.

The SM has multiple types of execution units including:

  • Single precision floating point and integer units (192 CUDA cores)
  • Double precision floating point units
  • Special function units
  • Load store units
  • Texture units
  • Branch units

See the Fermi and Kepler whitepapers for additional information.

“CUDA cores” are a useful marketing measure, but tell you very little about how an SMX executes instructions.

Take a look at Figure 2 in this whitepaper:

At the top are the 4 schedulers, which pick from the set of available warps (groups of 32 threads) on the SMX. They do not pick individual threads for execution. Once each has selected a warp for execution, the dual-dispatchers then pick up to two independent instructions from the warp to issue to the appropriate pipelines. Each column of 16 “cores” is a warp-pipeline, which can finish one warp instruction every 2 clocks (but an instruction takes something like 10-20 clocks to complete from start to finish). As a result, each of the 12 pipelines only needs a new instruction every 2 clocks to stay full. This means the SMX needs to issue arithmetic instructions from 6 of the 8 dispatchers every clock to keep all the CUDA cores busy. Load/store and special functions are separate pipelines.