Why newer arch GPUs have lower __syncthreads() throughput?

According to programming guide,

Blockquote Throughput for __syncthreads() is 128 operations per clock cycle for devices of compute capability 3.x, 32 operations per clock cycle for devices of compute capability 6.0, 16 operations per clock cycle for devices of compute capability 7.x as well as 8.x and 64 operations per clock cycle for devices of compute capability 5.x, 6.1 and 6.2.

I wonder why this happens. Are there any hints?

This has to do with SM design, and SM design is something that tries to take into account maximizing performance against resource utilization and die area, for the codes that the GPU designers have access to.

The cc6.0 GPU SM architecture in some ways can be viewed as “half” of a cc 6.1 or cc 6.2 SM. It has half the cores, for example, and half the warp schedulers (study the corresponding whitepapers). In this case it also appears to have “half the throughput” of the cc6.1 and cc6.2 SMs.

The cc 7.0 SM shares a lot of similarity with the cc6.0 SM.

There are various similarities between the 5.x and 6.1/2 SM designs.

Kepler (3.x) had a “huge” SM design, capable of issuing up to 8 warps in a cycle, as did 5.x and 6.1/2. Effectively, some later GPUs “pared down” the size of the SM. (However later GPUs also had/have many more SM per die.)

Beyond that, I think you have to look at the various changes as simply a balancing act of SM die area vs. performance gained, as measured on a suite of codes that the GPU designers consider to be “relevant” or “current” as they are making their SM design choices.

1 Like

Generally speaking, design decisions for processor microarchitectures have been informed by the use of performance simulations for the past three decades. The rationale for microarchitecture design decisions and details of performance simulation technology are sometimes published in scientific journals, but for industrial designs they more often remain business secrets, i.e. they are not communicated publicly.

Typically these simulations model one or several processor chips in conjunction with attached memory. In the 1990s, trace-driven simulation was the usual approach, where a trace is a sequence of instructions and memory references extracted from interesting parts of interesting applications (for some definition of interesting; for an x86 processor this would typically be a mix of business and HPC applications). Since the early 2000s, execution-driven simulation has become more common, which allows the execution of any existing software. This has advantages in the context of hardware/software co-design, in particular where processors and toolchains evolve together. One major disadvantage is that it is often significantly slower than trace-driven simulation.

From personal experience I know that creating simulators that provide accurate performance predictions for future hardware is hard. I found interactions with various kinds of DRAM particularly challenging to model accurately. However, the accuracy of processor and system performance simulators can be validated once the simulated hardware materializes, so the modelling usually becomes increasingly refined over time and the accuracy of the performance simulator(s) improves as design teams progress through architecture generations. Also, the various classes of applications whose performance is tracked usually becomes more comprehensive and detailed over time.

Therefore significant architectural missteps that sometimes occur early in a series of architecture generations are usually avoided once the performance simulation environment has matured.

1 Like

Great thanks!! That’s very helpful to understand the evolution of architectures.

That’s very inspirational. Many thanks!!
Is there any synthesis literature on architecture simulators?

I haven’t really kept pace with the latest developments in processor architecture design, so I am probably not the best person to ask.

Synthesis doesn’t really play a role for simulators. Maybe you meant emulators? In the 1990s those were often based on numerous boards full of FPGAs. The limiting factor there wasn’t gate equivalents (only about 5 percent of the FPGA logic gates were used, as I recall), but interconnect. In the early 2000s there was a switch to systems based on specialized DSPs. I have no idea what is being used today. Quickturn was a widely used brand of emulation systems, owned originally by a company of the same name and later by Cadence.

Processor companies tend to use proprietary simulators developed in-house, including performance simulators. However, there is an up-and-coming open-source processor architecture called RISC-V, and I would assume that the ecosystem created around it includes performance simulators. I did a quick one-minute Google search and there seems to be a simulator called gem5. You could trawl Google Scholar to find the latest relevant publications.

ThanksI should check for it.