specific Qs on GT200 processing abilities some might require Nvidia employee input

  1. Can different SMs execute different kernels or does the whole GPU execute the same kernel?

  2. Can texture caches be easily used for general-purpose data (and not textures) and hence provide a low-latency data source? (read-only, of course)

  3. All caches are read-only and managed automatically, correct? In contrast, the Shared Memory is allocated by the programmer by variable type definition and can be both read and written, correct?

  4. How far goes the SM’s ability to switch between threads in case an instruction is stalled in an SP due to memory read? Can the SM thread manager replace (free context switch) a stalled thread with a thread from the whole pool of 1024 threads held by an SM or just from a currently processed warp (32 threads in total)? My guess: just from a warp, since an SM can manage just one branch at a time.

  5. What is the exact SP processing rate? Since one instruction takes 4 cycles, do they produce one outcome every 4 cycles, or are they pipelined and produce one outcome every single cycle? (I might have asked this one before, but I don’t feel like I know the answer ;))

  6. Correct me if I’m wrong, but the overall GPGPU throughput is not neccessarily limited by memory bandwidth but rather memory latencies?

  1. All SMs run the same kernel.

  2. Yes, it’s common to use them for arbitrary data, especially 1D textures. Yes, read only.

  3. Yes, read-only and automatic texture cache. Shared memory is indeed random read/write access and under your complete control.

  4. Task switch ability is done per WARP and not per thread. That switching however is costless, literally 0 clocks. So a stalled read will stop that WARP from being scheduled until it is ready again.

  5. The SPs do create one result every clock, but with 8 SPs, one warp takes 4 passes to evaluate. There is pipelining involved, but the costless scheduling hides it. There are subtle other pipeline issues about hidden register memory pipelines though, but they’re best summarized as “use at least 192 active threads per SM, and an even number of warps”.

  6. Latency can be but usually is NOT the most common bottleneck. The massively threaded scheduling hides it in most applications. Every algorithm has different bottlenecks, but memory BANDWIDTH is probably the most common, especially for simpler kernels.