Regarding Fermi SM organization change Impact of mem latency from 8 stream cores to 32 per SM

The background is that, assuming Fermi doesn’t add cache, GPU traditionally rely on large number of concurrent threads execution to hide memory access latency. For the previous GT200 architecture, each SM contains 8 cores that execute a single primitive instruction in 4 cycles. We assume a SM maintains 1024 active threads, i.e., 32 warps, it needs 600/4/32 = 4.69 arithmetic instructions per load/store to hide 600 cycles of memory access latency.

Now come back to Fermi, which increased the number of cores from 8 to 32 for one SM, and 1 warp is executed in each cycle (actually 2 warps in 2 cycles due to duel scheduler). One SM maintains 1.5K active threads (48 warps), then it will need 600/48 = 12.5 arithmetic instructions to hide 600 cycles of memory latency (in other words, the same 4.69 ratio application will have 2/3 time waiting)… This seems to be a backward hit for many applications.

I also noticed that Fermi has added L1/SM and L2/Unified cache to address memory latency issues. 48KB L1 cache averages 32Byte for 1.5K threads, while 768KB unified L2 cache also averages to 32Byte per active thread. Performance could suffer in applications with a lot of cache cold miss (with compute bubbles idling on a 600 cycle latency). Meanwhile, in old GT200 architecture, these applications may have enough arithmetic/mem ratio to complete resolve those bubbles.

My current feeling is that, while most applications will benefit from cache and more concurrent cores, some may get negative impact.

I’d like to hear what you say about this ;-)

The first question I have is whether that 600 cycles latency figure will hold. Fermi has gone to GDDR5 on a 384bit bus, which will have tonnes of bandwidth, but probably even higher latency than the current GDDR3 memory the GT200 uses…

I feel the same way.

This probably means that Fermi requires a larger amount of loop unrolling. So: less threads, more simultaneous pending memory requests per thread, and consequently more register usage per thread.

The architecture trades data-level parallelism for instruction-level parallelism.

Also, instruction counts are difficult to compare when talking about two different architectures. The fermi ISA is slightly more “RISC-like” (whatever that word means after 20+ years of abuse…) and usually requires more instructions for address calculations and such. This gets worse in 64-bit mode (device pointer size matches the host pointer size). On the other hand, there are several new instructions (shift+add, insert, extract, multi-word moves to/from shared memory) and some existing ones are more flexible.

600 cycles / 400 ns is huge by any standard. The latency of the bus and memory chips is actually a very small part of it (LSChien made some estimates in these forums).

I believe that if the designers decide that latency does matter after all, they should be able to cut that latency by a significant margin. But there might be a compelling reason for them not to do so. I don’t know I never designed a GPU myself. :)

I totally agree with you. And yes, if it allows multiple independent “pending” mem access within a warp, loop unrolling could improve things a lot. I believe the old GT200 instruction scheduler works in a simple mem-blocked-switch-to-next-warp fashion, in-order and relies on the compiler to schedule in-warp instructions. So GT200 doesn’t have any in-warp ILP. Fermi may have gone a great depth in order to support book keeping multiple memory accesses, and out-of-order completion at least.

Anyway, developers has to rely on ourselves to resolves memory dependencies across threads as we used to do. Although data speculation in a CPU is not hard to achieve through hardware support on manageable number of threads, cache coherency, critical sections etc, it won’t be feasible for GPU model and won’t be efficient.

Interesting… how so "abuse"d? :)

Usually more RISC-like indicates more flexibility to move around finer grained operations, and less complexity in the execution pipeline for higher operating frequency. This is especially welcomed in an ILP context for CPU design.

I use model of DDR2 to estimate latency of GDDR3 in the threads

http://forums.nvidia.com/index.php?showtop…rt=#entry600637

and

http://forums.nvidia.com/index.php?showtop…rt=#entry603432

the latency is about 110.5 core cycle, however latency of TeslaC1060 is more than 500 cycle.

moreover @Sylvain Collange reminds me that

GT200 and even G80 do already support out-of-order completion of loads from global and texture memory. IIRC, they support up to 6 outstanding memory transaction per warp.

(If you’re curious, here are some hints on the kind of circuitry they might use: http://www.google.com/patents/about?id=vuF3AAAAEBAJ, http://www.google.com/patents/about?id=vDiuAAAAEBAJ)

I was referring to David Patterson’s whitepaper:

http://www.nvidia.com/content/PDF/fermi_wh…NVIDIAFermi.pdf

which claims that “the Fermi architects changed instructions sets completely to a more RISC‐like load/store architecture instead of an x86‐like architecture that had memory‐based operands.”

Now that I’ve done some analysis of the code produced by the CUDA 3.0 beta compiler, I have to disagree with that statement…

First, while Fermi now has no more memory-based operand from shared memory, they still take operands from constant memory. Global and local memories were always accessed with separate load/store (actually gather/scatter) instructions since G80.

The instruction words are probably slightly easier to decode. However, I believe the language is getting more and more high-level. The unified memory space, the new SIMT control flow management system, the (probable) lack of designated address registers all require less effort from the compiler, but probably have a cost in hardware. This seems completely opposite to the RISC philosophy to me (not implying it is a bad thing, RISC has lived…)

Anyway, if Prof. David Patterson himself starts abusing the “RISC” word, there is no more hope for us to make any sense out of this word. ;)

Are you thinking that the “lack of designated address registers” will mean that it will be possible to dynamically allocate arrays in register memory? This would make programming registers so much easier without the need of heavy unrolling and manual coding, this to be able to tap into the potential of the register memory. It would also make it easier to have less threads as you guys have suggested.

No, quite honestly, I don’t think this has any chance to happen. A dynamically indexable register file in a SIMD processor is just a nightmare for any computer architect. All the more when there is a need to extract parallelism between instructions from the same warp: just think of how to check the read-after-write dependencies between instructions that use dynamic indexing.

However, what you’ve got in Fermi instead is a smaller register file (relative to the number crunching power) and a larger shared memory/cache. So: just use shared memory or local memory to store arrays, and save those registers for very temporary variables. Just as you’d do on a regular CPU…

Isn’t analyzing PTX a bit like looking at Java bytecode, trying to determine what the target architecture may or may not look like.

Yes. I completely agree with you.

What I’m talking about is the actual machine code, disassembled from the Cubin. :D