Latency and low-level performance questions

Hi all,

I’m in the process of assessing CUDA based cards’ performance for real-life GPGPU applications. For that I have few questions regarding Kepler’s inner-workings:

  1. For the LD/ST units I undertand that they have queues, where 2 LD requests received on cycles 3 and 4 will be sent to the memory controller one clock apart and (likely) be received the data one clock apart (say at clocks 503 and 504). Is the queue part of the LD/ST units or the memory controller?

  2. Arithmetic operations (handled by cuda cores) are said to take 22-24 cycles. Do the cores have queues like the LD/ST units? Can the cores receive a fmul instruction at cycle 10 and then another one at cycle 11? If so, will the operations be handled in parallel, yielding results 22-24 cycles later, say at 32 clock for the first one and 33 clock for the second? Or will they be processed serially, yielding results at clocks 32 and 54? Is this why are there so many cores (192) per SM? Or will a cuda core be unavailable for new instructions until the existing one finish - e.g. the first instruction is processed at clocks 10-32, the second on clocks 33-55?

  3. What other units in the Kepler chip have queues, and how deep? Which units will just stall the threads until next clock or until last one finishes?

  4. Besides arithmetic, what do the cuda cores handle? For example, where is the instruction decode being done? What unit is responsible for the branch instruction? etc. and how much of these units are in an SM?

  5. Number of issuing threads - If I understand correctly for a Kepler chip with 4 warp schedulers and up to 2 independent instructions per warp and 32 threads per warp, and 192 cuda cores, the number of threads then is 4232=256 threads per SM at a maximum (more likely 128 for 1 instruction).

Thank you for replying!

I don’t know the answers to all these questions. I think many of them represent information that is currently unpublished, therefore answers would be based most likely on experimental analysis or conjecture.

A CUDA SM consists of a variety of functional units. Functional units handle specific types of instructions or operations. There isn’t really a “core” except insofar as “core” means SP unit. However an SP unit would not handle integer operations, for example. Therefore the differing throughputs of various operations will be largely dictated by the number of functional units on the SM that handle that type of operation.

  1. Arithmetic units like the SP (single precision) and DP (double precision) unit can accept a new operation on each clock cycle (Fermi requires a careful definition of clock cycle.) Likewise, they can retire an operation per clock cycle.

  2. For a Kepler SM, the peak instruction issue rate would be 256 per clock, just as you calculated. This would necessarily imply that not all 8 of those instructions could be SP floating point instructions, but in theory up to 6 of them could (6*32 = 192). In practice, such high instruction issue rates are not typically achieved. - Scott Gray may write code that achieves this, but if you survey general CUDA codes running on Kepler, you won’t find this to be the case, typically.

txbob,
Thanks for your answer.
I seem to remember a picture in some formal (?) doc in which a cuda core was described as containing both sp float and 32bit integer functions in it. It must be wrong, then. I see the throughput for e.g. iadd which is 160 per clock, but I wonder at what latencies?
It is very informative on NVidia’s part to note the throughput numbers per clock. However, it is only one part of the equation - latency is the other part.
What do you think?

To first order GPUs are throughput machines which cover latency through parallelism. A simplified but useful way of looking at this is that they use zero-overhead thread switching whenever a thread stalls. You just need to have a sufficient number of threads running to cover all the basic latencies. On modern GPUs this means running about 256 threads per SM.

On the extreme (“ninja”) end of the programming spectrum, knowing the latencies may be helpful, rather than the relative throughput, of various operations. But other than that, knowledge of latencies (which will differ by architecture) is not needed. In all likelihood “ninja” programmers will be coding directly at the machine language (SASS) level.

For a back-of-the-envelope performance assessment, you would want to start out with a basic roofline model, incorporating things such as memory bandwidth, computational throughput, and interconnect bandwidth. For realistic scenarios, you would want to assume about 75% of the stated theoretical values for both memory bandwidth and computational throughput.

In general, the compiler adjusts the scheduling of instructions based on their latencies as an additional insurance against whole-SM stalls, beyond parallelism. This affects loads in particular. It may also try to batch certain loads to use the queues more effectively. Note that scheduling loads early may occasionally negatively impact occupancy due to higher register use. As for the finite length queues in the LD/ST units, the important strategy is to make each load as wide as possible to maximize the amount of load data pending in the queue. So if processing uchar data, you would want to make sure to process it as uchar4 or uchar8 data. The hardware provides native loads up to 128 bits in width, e.g. double2 for double-precision complex data.

As for branch handling, the only thing worth noting is that there is no branch prediction at all, straight-line execution is assumed. Pipeline hiccups from branches are covered by parallelism as are other latencies. In my observation, the only time where there is a noticeable effect is when there is a loop whose body exceeds the ICache size. Because of the loop-closing branch without branch prediction, the ICache miss when jumping to the top of the loop is noticeable as a small but measurable performance degradation (about 3% in the cases I looked at).

njuffa, thanks for your reply!

From a micro benchmarking I did yesterday, SP division has ~200 clocks latency. This kind of pitfalls I’m trying to avoid. SP div is too slow? cache division results. that changes the design of the solution.

Regarding load width: coalesced loading is understood, only consecutive 32 bit loads. But, will it be more efficient to load 64 bits in a row? e.g.

uint64 val=data[threadIdx.x];

If so, is the change in throughput meaningful?

I dont understand the 256 number. 256 threads is just 8 warps. Kepler has 4 warp schedulers with up to 2 instructions per clock, so the 256 will be issued in up to 2 clocks! How is that enough to cover latencies of arithmetic (24) ,L1(20), let alone global memory latency(400-800)?
From my calculations, I will have to run at the maximum per SM (2048), and even that does not cover all latencies. I was thinking about 2 blocks of 1024 threads per SM (as block size is limited to 1024). What do you think about it?

BTW: I guess the icache/constants/parameters have the same latency as of the L1/shared, right?

Yes, to cover global memory latency typically many threads are needed. The “basic latencies” mentioned are the latencies of on-chip mechanisms, while memory is off-chip. The following paper by V. Volkov provides some good perspective on the interaction between occupancy and performance:

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

While the paper is a bit dated, much of its overall outline still applies. I think there are slide decks for presentations by Paulius Micikevicius of NVIDIA available that give a updated outlook on the minimum number of threads needed to cover basic latencies, with sufficiently detail to see how the number were arrived at.

If you measure the maximum achievable throughput for loads of various width, e.g. using STREAM-like kernels, you can see for yourself that memory throughout keeps increasing as each individual access gets wider. This is because the same amount of data requires more entries in the LD/ST unit queues when each load is narrower, but the depth of the queue is limited. So efficiency is maximized as the access width increases.

I don’t understand the question about ICache. Kernel arguments are being passed in constant memory since sm_20. On older GPUs there was a separate constant cache with broadcast capability; I believe this has been absorbed into a general read-only cache in Maxwell.

All division operations on the GPU are implemented through compiler-injected instruction sequences. Last I checked, the throughput of the floating-point divisions was competitive with hardware-based implementations in CPUs (considering the ratio of FMA to FDIV throughput). I have never had a need to consider the latencies of these operations in almost 10 years of CUDA programming.

I have never been the kind of engineer who tries to assess performance based on paper calculations using detailed architectural description; I have always advocated an experimental approach. I would suggest getting a feel for performance by porting some code, writing some prototypes, observing the performance as execution parameters change and using the profiler to provide some guidance as to potential bottlenecks. The Best Practices Guide is another valuable resource for tips on how to write software for the GPU with good performance.

njuffa, thanks again!
I’m familiar with Volkov’s paper. His usage of registers is excellent, but relevant only to data which is independent, e.g. computing output[i] needs only input[i] (± small discrete offset). It will not work for dependent data: “is today’s temperatures (input[i]) 20% higher than the maximum in the last 50 days?”.
Also, using much more registers limits the amount of threads you can run on the SM.

I was asking if the icache has the same latency as L1. Same for constant memory.

Your point of view is understood. I’m trying to understand what Kepler (or other Arch.) can provide, and try to utilize it best.

This might be useful: Analyzing GPGPU Pipeline Latency

Hi, there is an NVIDIA documentation where i can read the latency of a single ptx operation?

For example:

How many clock cycle for mul.lo.s32?
How many clock cycle for st.shared.f64?

I don’t think so. Note that it will vary by GPU, and for some instructions may also vary by CUDA version.

I know a bit about computer architecture, so I’ll try to answer a few of the architecture questions

  1. “Is the queue part of the LD/ST units or the memory controller?”
    I’d say there would be book keeping structures throughput the entire memory hierarchy to allow doing memory ops. asynchronously. The LD/ST units in the SMs would probably be similar to those in CPUs. The obvious use is to do book keeping to hold the ID and status of outstanding requests so it knows what to do when it gets an event from the next level of the memory hierarchy. Other uses are store to load forwarding (probably not done on throughput optimized processors) and memory disambiguation.

And they don’t have to be queues. You can reorder memory accesses for ||ism.

  1. Arithmetic operations are said to take 22-24 cycles.
    That’s out of date (was so for G80).

In order processors don’t need “queues” between pipeline stages because almost every stage produces data for the next stage at a fixed rate (no unpredictable delays), with the exception of memory access. So although you could add a queue after instruction decode so the later stages won’t have to stall, I doubt that’s done on multithreaded processors like GPUs because they can just switch to executing another thread, which is more efficient.

  1. I wouldn’t be surprised if the processor does no decoding. If you disassemble the CUBINs, you can see that the instructions are 8 bytes each!

Good. It seems latency of most common instructions has fallen dramatically from 24 for G80 to 6 for Maxwell, otherwise you would need a ridiculous #threads to keep all the units busy. For G80, 192 threads was often good enough, even if every instruction was dependent on the previous (minThreads = (latency = 24) * (executionWidth = 8) )

I remember that being a win when reading from system memory (mapped into the GPU address space) due to the long latency of PCIe. In general, for synchronous/round trip operations like fread(), sockets, you need to use large block sizes to compensate for the latency to achieve high throughput (bandwidth delay product).