I want to know if my understanding is correct or not.
Both GTX580 and GTX680 can execute only 32 threads concurrently per SM. So GTX580 can executes 512 threads concurrently whereas GTX680 only 256 threads. But per C programming Guide Section 5.4.1, SP math of GTX680 is 6x faster than GTX580. So if all other things being equal, we should expect GTX680 to be 3x the performance of GTX580. But in reality, all other things are not equal, so we end up with about 2x performance.
We can also think that for GTX580, each core handles one thread. But for GTX680, each thread is handled by six cores.
This is the wrong way to think about concurrency in CUDA devices. On the CPU, threads are “assigned” to a CPU core for many clock cycles because a context switch between threads has some overhead. On a CUDA device, each is thread is statically allocated the registers and shared memory it requires for its entire lifetime, so there is no need to have the concept of “switching threads”. Instead, the warp scheduler issues instructions from any of the available warps each clock cycle (or two). As a result, the pipeline for each CUDA core will contain at any given time the partially executed instructions from 10-20 threads, depending on the CUDA compute capability of the device.
It usually makes more sense to think about CUDA performance in terms of overall instruction throughput, memory throughput, and memory latency rather than individual threads running on CUDA cores. To reduce power consumption in the GTX 680, NVIDIA traded clock rate for die area (mitigated by moving to smaller transistors). The GTX 680 has 66% of the clock rate of the GTX 580, but 3x the CUDA cores, leading to a net single precision floating point instruction throughput of 2x compared to the GTX 580. However, things are even more nuanced with the GTX 680 as the throughput of different classes of instructions has changed. Table 5-1 shows that integer and logical instruction throughput is lower than you would expect from the scaling by clock rate and # of CUDA cores. On the flip side, the GTX 680 has 4x as many special function units as the GTX 680, so code that depends on those operations should go faster.
The memory bandwidth of the GTX 680 and 580 are about the same, but the 680 has much less L1 and L2 cache compared to the number of CUDA cores, so algorithms that depends on the cache for good performance can also run slower on the 680. I’m not sure how the memory latency in clock cycles compares between the two architectures, although I could imagine that the effective latency is better on the GTX 680 because the clock rate of the CUDA cores has been lowered.