L2 cache difference between Tesla and Xeon Phi - impact ?

As far as I know, there is a big difference in the L2 cache size between Tesla K20 (1.5 MB L2 cache) and Intel Xeon Phi (28 MB L2 cache). I am wondering about this big difference, what impact has it on the performaance on certain application classes ? So - what is the ‘better’ approach ?

You might be interested in seeing this benchmark running on a wider set of applications, testing different areas of the hardware:


As you can see the Xeon Phi performance is thus far quite lacking. It gets pretty much annihilated…

The Phi has 512 KB L2 cache per core. Once it has a request for some of the other 28 MB memory it will have to go out on the ring bus, introducing a 235 CC latency, and likely a stall on it’s 4 hyperthreads running in a round robin fashion. Basically there have been som reports of latency issues, 60+ cores working efficiently on a ring bus is a huge challenge, and yeah, cache coherency is difficult.

The story seems to get more interesting when your problem grows outside of the cache size though. Apparently it does not hide cache miss latency very well and according to intel tech folk I’ve been talking to it’s very hard reaching over 50% bandwidth utilization. Which is probably why they’ve suited it with an insane 320+ GB/s bandwidth capacity, to compensate.

Anyways this is a hot topic with many opinions. But I will say that the Phi has a lot left to prove.

I have no experience programming the Phi (really sad they weren’t able to have a consumer line, like NVIDIA w/ the GeForce). However, on paper, it looks like a different granularity tradeoff. I’ll try to quantify that a bit below.

It is somewhat difficult due to discuss the differences between the Phi and the Tesla because of the confusingly non-equivalent terminologies of SIMT and SIMD programming. A Phi “core” is not a Tesla “core”, and a Phi “thread” is not a Telsa “thread”. The translation is roughly:

  • Phi core = Tesla multiprocessor (or SMX)
  • Phi thread = Tesla warp

With that in mind, we can try to compare things between the Phi 7120X and Tesla K20X:

  • Number of “Cores”: Phi has 61 cores and Tesla has 14 multiprocessors.

  • Vector instruction size: Phi uses 512-bit wide SIMD instructions. Telsa uses a warp size of 32 “threads”, which corresponds to 1024-bit wide SIMD.

  • of vector processing units per core/SMX: The Phi has one general purpose vector processing unit per core, with a throughput of 1 vector instruction per clock. The Tesla SMX has 12 general purpose pipelines, each composed of 16 “CUDA cores” (giving the canonical 192 “CUDA cores” per SMX), which have a throughput of 1 warp per 2 clocks. Both devices have special pipelines for transcendental functions.

  • of active “threads” per core/SMX: Each Phi core can have up to 4 active threads, whereas the Telsa SMX can have up to 64 active warps. (Note that “active” is in the simultaneous multithreading sense, i.e. the hardware can assign execution resources to these threads without swapping register contents in and out of memory.)

  • of registers per “thread”: The Phi has a hard partitioning of the 128 vector registers per core, with each of the 4 hardware threads getting 32 registers. Tesla is more flexible, with each warp getting up to 255 “vector registers,” although in such a configuration there can only be 8 active warps. With the maximum of 64 active warps, each warp can only use 32 “vector registers.”

As you can see, the Tesla is an extremely WIDE architecture, with each multiprocessor handling a lot more computing state than the Phi. Coming from a graphics background, Tesla is designed to hide stalls due to memory latency by switching between a very large pool of active threads. Fully maxed out, a K20X will have 896 active warps, compared to 244 on the Phi. This may be part of the reason that people are having trouble achieving full utilization on the Phi. On the flip side, there are situations where developers underutilize the Tesla because their problem does not map to its extremely wide architecture very well.

I think the finer granularity of the Phi has the potential to be very useful for more branch-divergent problems that can still fit their working set of data into the relatively large per-core L2 cache (and minimize reads to the device DRAM). In particular, I would imagine that various kinds of Monte Carlo algorithms could work very well on the Phi.

Thanks for insightful thoughts Seibert!

In the comparison of Phi and GPU cores I sometimes like to think of each SIMD unit on a GPU as a core as it’s basically free to diverge from the rest of the block.

Hence you might say that each Kepler SMX contains 4 cores (4 warp schedulers), giving it a maximum of 4*15=60 cores (yes there are 2880 core keplers now…).

“As you can see, the Tesla is an extremely WIDE architecture, with each multiprocessor handling a lot more computing state than the Phi”

Yes and as you hint that while Phi has ~30 MB of cache Kepler has a much larger register file of 15*256 KB => ~3.84 MB in total which gives the possibility of a very large active computing state, hiding the latency.

Basic charateristics:
Phi -> Huge caches, small register files => issue hiding latencies?
GPUs -> Huge register files, small caches => issues with severe random memory access patterns (?)

One thing I can’t find any documentation on is whether there are any atomic memory operations on the Phi. I have made great use of the fast atomic operations on Kepler. It would be unfortunate if the only way to atomically increment a global counter on the Phi was to use a software lock.

I haven’t been able to figure out the “killer app” for Phi yet. One of its distinguishing hardware features is this extremely powerful but complex ring bus for keeping the many many L2 caches coherent. The old Larrabee SIGGRAPH paper goes into considerable detail about this ring bus and the effort needed to keep the data synchronous. So while that’s nifty, I haven’t figured out what application is super-dependent on that feature. Perhaps some kind of transactional database where the Phi cores are both reading and writing the same memory? But for most compute I can think of: matrix solves, Monte Carlo financial computes, Monte Carlo random walks, pixel/surface shading, raytracing, finite element/finite difference… none of them really benefit much from it.

Seibert, you mention correctly that Phi DOES have a strength with divergent workloads, simply because its working set is narrower (16 elements per thread, as opposed to CUDA’s 32 threads per warp.) I think you’re right… that may indeed be a feature where some apps may prefer it. But Monte Carlo probably isn’t one… though of course Monte Carlo isn’t an algorithm, it’s a class. Raytracing is usually highly divergent and likely something that WOULD work well on Phi. But NVIDIA’s research scientists literally define the state of the art in raytracing, and Intel’s lost much of its raytracing research focus (especially with the loss of Warren Hunt et al) so I’m not sure if we’ll see the full potential of Phi for raytracing.

About raytracing, GTX Titan outperforms Phi about 9.8x in this benchmark:

I’m guessing they’ve found ways to avoid a lot of the potential divergence.

I can’t really think of any application in my field that would benefit from it either. I’ve talked to some people working with “big data” that thought it sounded interesting for some of their applications but when they realized that it would only be 1.5-2x faster in best case for their applications they said “meeh why bother? We’ll just buy a few more Xeons rather than modifying our code base.”

The types of Monte Carlo algorithms I’m most familiar with are simulations where random numbers are repeatedly used to decide which branch of an if-statement to take. These kinds of problems are naturally very divergent, and it is pretty challenging to come up with ways to avoid such divergence while keeping the answers statistically accurate.

(Aside: I still really would love to have a __warp_regroup(int order) barrier that would allow the threads in a block to reorganize themselves into new warps after a decision point. There are all sorts of evil hardware issues to solve to do something like this, but it would fix the above problem.)

Seibert, yes, I know exactly the kind of “warp regroup” idiom you want! And it’d be fantastic. But there are real architecture problems fighting against it. If the registers themselves were copied and data shuffled, that’s a LOT of data movement, so it’s really expensive. And it’s something we can do now manually with a prefix scan and fill using shared memory (and we can do it more efficiently since we know just what state needs to be swapped.)

So alternatively we’d like the hardware to do it cheaply… the threads get renamed with the data staying in place. This sounds perfect… except that inside an SMX, the registers themselves are banked. This has no performance cost and reduces the register to SP crossbar complexity by a factor of 32. But if you allowed the registers to have any lane slot, you’d either need an unwieldy crossbar, or use multiple clocks just to send data to the SPs when there’s a bank collision (just like writing to shared.) That would be too expensive to be worthwhile.

There are tons of advanced ideas how to solve the divergence problem in hardware… AllanMac keeps up on a lot of the options.

There’s a great paper from HPG this year that shows for raytracing, it’s actually worth the huge hassle to dump diverged rays (using prefix scan to group the work) and reload the data even in new kernel launches. For the typical raytrace workloads, they found the expensive reordering was still worth it, especially since multiple kernels allowed register sizes to be minimized. If it was all in one kernel, you’d get huge register use, even though most of the kernel used only a fraction of the registers. This had a lot to do with the render’s typical divergence workload which was dominated by shaders of varying size and behavior.

Thanks for the tip on the paper!

One way I could imagine allowing warp reordering in software (so maybe not particularly fast, but it would save typing) would be to abuse the mechanism that allows dynamic parallelism on CC 3.5, but instead to “suspend” and relaunch a block with a new thread ordering. I haven’t inspected the PTX of a kernel with dynamic parallelism to see if this mechanism could be mocked up by a non-NVIDIA person as a proof of concept, but I should take a look…