Warp Schedulling

Hello.

What can cause an warp stall? I’m interested in how are the warps scheduled… How to determine how many cicles will an warp need to execute… Initially for pre-fermi architetures.

thanks.

Hello.

What can cause an warp stall? I’m interested in how are the warps scheduled… How to determine how many cicles will an warp need to execute… Initially for pre-fermi architetures.

thanks.

things that will cause a warp to slow down:

*branch divergence w/in a warp. i.e if some threads in a warp take one branch and some take another, the time to execute will be the time to execute BOTH branches. this is because every thread actually WILL execute both branches, it will just “predicate” each instruction so that only the ones belonging to its branch actually effect any data.

*memory contention. (“hazards”) if two threads in a warp write to the same bank in the same instruction, etc. it’s in the performance tuning .pdf.

things that will cause a warp to stall:

*worst of all is global memory access (as opposed to constant and shared). global memory has a latency of 100s of cycles. (a fermi has a cache on this, so only ones that miss the cache will suffer this penalty)

(consider this: the memory clock is a little over twice the core clock, and each global memory access is actually 100’s of global memory accesses (1 per thread), so the maximum possible throughput is:

memory_bandwidth (in 32-bit words) * mem_clock_to_core_clock_ratio / threads_per_request

on a GTX 460, that’s about 8 * ~2.5 / 336. and that’s the absolute maximum technically possible. so even at 100-cycle latency, you’re pushing the theoretical limit.

anyways, with latency on the order of 100-cycles, this will definitely stall out the warp for a long time, so the best thing to happen here is a task-switch, which i’m sure the scheduling hardware will do in this case. but if you don’t have any other kernels in gpu memory, there’s no task to switch to so you’re just screwed.

best way to avoid/reduce this is to maximize “memory access locality” as much as you can and then pack as much as you can into shared memory.

anycase, i recommend taking a look at the performance tuning guide.

things that will cause a warp to slow down:

*branch divergence w/in a warp. i.e if some threads in a warp take one branch and some take another, the time to execute will be the time to execute BOTH branches. this is because every thread actually WILL execute both branches, it will just “predicate” each instruction so that only the ones belonging to its branch actually effect any data.

*memory contention. (“hazards”) if two threads in a warp write to the same bank in the same instruction, etc. it’s in the performance tuning .pdf.

things that will cause a warp to stall:

*worst of all is global memory access (as opposed to constant and shared). global memory has a latency of 100s of cycles. (a fermi has a cache on this, so only ones that miss the cache will suffer this penalty)

(consider this: the memory clock is a little over twice the core clock, and each global memory access is actually 100’s of global memory accesses (1 per thread), so the maximum possible throughput is:

memory_bandwidth (in 32-bit words) * mem_clock_to_core_clock_ratio / threads_per_request

on a GTX 460, that’s about 8 * ~2.5 / 336. and that’s the absolute maximum technically possible. so even at 100-cycle latency, you’re pushing the theoretical limit.

anyways, with latency on the order of 100-cycles, this will definitely stall out the warp for a long time, so the best thing to happen here is a task-switch, which i’m sure the scheduling hardware will do in this case. but if you don’t have any other kernels in gpu memory, there’s no task to switch to so you’re just screwed.

best way to avoid/reduce this is to maximize “memory access locality” as much as you can and then pack as much as you can into shared memory.

anycase, i recommend taking a look at the performance tuning guide.

I think that several things will cause a warp stall:

  1. A barrier or memory fence.

  2. A scoreboard hazard (eg a register dependency).

  3. An I-cache miss.

  4. A memory access and out of MSHRs.

  5. Am I missing anything?

Obviously things get complicated when you have many of these factors interacting dynamically in a program. See this paper for a model for determining how long a warp will take to execute on Tesla:

http://www.cc.gatech.edu/~hyesoon/hong_isca09.pdf

I think that several things will cause a warp stall:

  1. A barrier or memory fence.

  2. A scoreboard hazard (eg a register dependency).

  3. An I-cache miss.

  4. A memory access and out of MSHRs.

  5. Am I missing anything?

Obviously things get complicated when you have many of these factors interacting dynamically in a program. See this paper for a model for determining how long a warp will take to execute on Tesla:

http://www.cc.gatech.edu/~hyesoon/hong_isca09.pdf

Probably the very slowest stall is a zero-copy memory read. Admittedly that’s just the extreme end of the “global memory read” L1/L2/device/zerocopy memory read chain.
I wonder how many GPU clock ticks the PCIe latency is.

Though perhaps __threadfence_system() is even slower?

Probably the very slowest stall is a zero-copy memory read. Admittedly that’s just the extreme end of the “global memory read” L1/L2/device/zerocopy memory read chain.
I wonder how many GPU clock ticks the PCIe latency is.

Though perhaps __threadfence_system() is even slower?