Time for Nvidia to build Deep Green


16 Tflops out of 800 cpu’s. meh.

How much effort does it take to program a good go playing computer? Is playing strength really just defined by the amount of teraflops that you put in? the article surely made it seem so.

Go has been the last bastion of humans defeating machines–I don’t know much more than that. I’m guessing that creating a move tree is much more computationally intensive than something like chess, so having a lot more brute force to throw at it would make a difference more than it would to chess. But I don’t really know–game AI was never something I did much in.

Chess can be boosted by more compute power, with diminishing returns, but with enough CPU cycles, a decent but not perfect chess algorithm can become unbeatable in practice.

Go is much more of a PATTERN MATCHING search. More compute does help, but the algorithms and heuristics and especially databases tend to make much more difference… it’s definitely an art form, and the answers are unclear. It’s much closer to AI than chess.

Current strong go programs use Montecarlo Techniques. Also the articles are very inaccurate at best about what people says … if you want to know more, you should probably give an eye to the computer-go mailing list


So a random monte carlo sampling should be acceleratable greatly on a GPU, right? Except maybe there are some issues with branch divergence.

Most “intuitive” techniques do indeed imply divergences. (i have tryed some …).

Basically, you have to fill out a matrix, one move at a time, until it’s full. And for that alone, it’s not that easy to build something without divergences. Especially if the “simulations” do not start with the same quantity of free moves to begin with. ( Most algorithms that i know of for go, uses one simulations at a time, integrates the results, and it is used to guide the next simulation). So you probably can go through different branches, picking a lot of good points to make your simulations. (and then batching them together) But you won’t in most cases get the same state to start with. And so they won’t end at the same moment. But then, in go, some filled up moves, can become empty again. And there you may get both a divergence for the checking of that “i have to empty some points” wich is called capture, and because once again, you won’t have synchronized endings.

All in all, (most of) the (montecarlo) programs weren’t made to use synchronized batch of threads well. There certainly are solutions, but you may have to redo all research (on algorithmic design) and parameters tuning from scratch. OR you may live with divergence. I didn’t really studied how the cuda memory model works (yet), but you certainly may get real issues with that too. All in all, for most of my trys, i was under the impression, that the gtx280 32 bits C version wouldn’t do that much better than a overtuned 64 bits x86 + SSE4.1 assembly version running on a quad core.

So i guess that if you want performances, it may be more reasonable to try to get an optimised x86 version first (which i think is far easier). I would be very interested if someone was able to run simulations (and use them) efficiently on a gtx280 card :)

THEN … for high end programs like the one running on the 800 cores, the limitations doesn’t really seems to come from the speed of raw simulations, but more related with memory bandwidth, and synchronization delays. (they soak up large amount of memory for each thread, and then try to synchronize those every once in a time). So maybe there is room there for accelerating the tree lookup and feedback part. One thing is sure : it’s all very tricky and not granted to yield any result.