Some where I ran across a description of the GPU hardware architecture and now I can’t find it. Specifically I’m looking for the C1060 description. I did find the paper by T. Halfhill “Parallel processing with CUDA” and that’s helpful, but I’d love to have a block diagram of the C1060 so I can link the software to what actually happens in reality. Any pointers appreciated.
This is generally considered to be about the best analysis of the GT200 architecture.
Thanks, that is exactly what I wanted!
I’m now really confused.
Each SM (Streaming Multiprocessor) has 8 FPU’s, or “cores”. The article says there is a maximum of 1024 concurrent threads per SM, but a maximum of 512 threads per block. Each core can execute one block. What I don’t understand is how each core can execute 128 threads simultaneously let alone 512.
If each FPU can do one FLOP per cycle, and the base cycle rate is 1.3 GHz, then 240*1.3e9 = 3.12e11 FLOP. I’m missing a factor of 3. Since the core can do a multiply-accumulate in one cycle, that gives me a factor of 2, but I’m still missing a FLOP per cycle.
The point I’m really missing is the connection between thread and FLOP. If I’m doing a matrix multiply, I can make each multiply in the matrix into one thread. I can only do 3 FLOPs per core per cycle, not 512. How the hell can you do 512 concurrent threads per FPU?
The first point of clarification is that “concurrent” or “active” really means scheduled, so that while there can be 1024 active threads, only a warp of 32 threads is running at any given time (basically each ALU/FPU runs each instruction 4 times to service the full warp of threads). The other threads become active and context switched onto the ALU/FPUs in warps of 32 as the hardware is free and instructions and data become available. Multiple blocks (not one) can be scheduled on the same SM simultaneously. This is the key to memory latency hiding and performance of the GPU. Lots of active threads with zero overhead context switching between them.
The hardware can theoretically issue a MAD and an additional MUL per cycle, giving a total of 3 FLOPs per cycle per core.
OK, “concurrent” is not “simultaneous”. I was thinking each SM has 8 cores, so it can work on up to 8 blocks, but I wasn’t seeing where the warps fit in. Now I do.
Thanks, the key to keeping the pipes full is understanding how the hardware works. The details of how it executes each instruction 4 times on multiple data would be interesting - the coupling to the register memory is pretty key to that.