I am a beginner in CUDA C. I am working on a toy project to learn CUDA using my GTX 680 card. The architecture of GTX 680 states that there 8 SMX’s in the GPU and each SMX has 192 physical cores, adding to a sum of 1536 CUDA cores. Considering that there are actually 4 warp schedulers in each SMX, this still means that only 128 threads can be executed in parallel per SMX. Does this mean the remaining 64 cores in the SMX are not used ?
A good hint is to forget about your specific GPU
good code will run on your GPU or the far smaller one in my 2 yr old laptop without any changes,
both of which will run several blocks (each of 1 or more warps) at once.
Generally I write code for a thread, keeping in mind that when dealing with arrays its better for adjacent threads to be accessing adjacent cells (adjacent columns on same row)
I suggest looking at something like the nbody example in the SDK
At the top are the 4 schedulers, which pick from the set of available warps (groups of 32 threads) on the SMX. They do not pick individual threads for execution. Once each has selected a warp for execution, the dual-dispatchers then pick up to two independent instructions from the warp to issue to the appropriate pipelines. Each column of 16 “cores” is a warp-pipeline, which can finish one warp instruction every 2 clocks (but an instruction takes something like 10-20 clocks to complete from start to finish). As a result, each of the 12 pipelines only needs a new instruction every 2 clocks to stay full. This means the SMX needs to issue arithmetic instructions from 6 of the 8 dispatchers every clock to keep all the CUDA cores busy. Load/store and special functions are separate pipelines.