Hacker Newsnew | past | comments | ask | show | jobs | submit | cbutner's commentslogin

The commentary net inspects the final state of the engine net, but not internal layers.

Deeper introspection is a really important goal, but by the time you make serious progress there, chess is the least of your worries.

I do really like the work people have put into introspection and visualization so far though: DeepDream comes to mind. There was also another great paper or page that I can't find.


It does train on variations too, given the scarcity of data available, so that can hurt accuracy, mood, etc.


The original hope was for this to be a third head on top of the AlphaZero model, but I couldn't think of a way to generate commentary during self-play (such that it would gradually improve), and trying to rotate supervised commentary training into the main schedule ended up hurting both sides because of the disjoint datasets.

So, now the commentary decoder is just trained separately on the final primary model. The previous and current game positions are fed into the primary model, and the outputs are taken from the final convolutional layer, just before the value and policy heads. Then, that data plus the side to play is positionally encoded and fed into a transformer decoder.

It would be better for a search tree/algorithm to be used for commentary too so that tactics could be better understood, but that would need some kind of subjective BLEU equivalent, and metrics like those don't work well for chess commentary.

You can see a diagram of the architecture here: https://chrisbutner.github.io/ChessCoach/high-level-explanat...


I think training this as a separate head on top of a frozen AlphaZero model makes a lot of sense. I don't think anyone has figured out to do language learning with reinforcement training.

Actually, I can't figure out from your explanation why you trained the whole network yourself instead of just using Leela's network and training the commentary head on top?

If you wanted to in-cooperate the search, maybe you could just take the 1800 or so probabilities output by the MCTS and add some layers on top of that before concatenating with the other data fed into the transformer.

In either case, this is a fantastic project and perhaps an even more impressive write up! Congrats and thank you!


It was partly because I was looking to improve self-play and training tractability on a home desktop with 1 GPU (complete failure), and partly to learn about everything from scratch. I would be interested to see how strong it is with the same search but with Leela's inference backend (for GPU at least) and network.

In terms of search-into-commentary, concatenating like that may be interesting, as long as it can learn to map across - definitely plausible without too much work. I was originally thinking of something more complicated, combining multiple raw network outputs across the tree through some kind of trained weighting, or additional model via recurrence, and punted it.

Ignore my BLEU comment, mixed those up between replies - that was the other potential use of search trees for commentary, an MCTS/PUCT-style alternative to traditional sequential top-k/top-p sampling, once you have logits and are deciding which paragraph to generate.

Thanks!


It is using a full-sized transformer decoder, trained on about 1 million data samples, but with far fewer neural network parameters and training samples than GPT-2 or GPT-3.


I agree with what you're saying. On the flip side, there are multiple systems (Elo, Glicko), anchors, playing pools, etc. in use around the place, and FIDE and CCRL are offset by around 80 magnitude I heard, compared to about 600-700 difference between top humans and top engines.

So for a non-technical audience, I feel like it's easier to give a ballpark that they can understand without having to pull in too much context around Stockfish, CCRL, etc. It may have been better to clarify further in the docs though.

The "Data" document does give the relative Elo breakdown in the appendices.


It does tend to name-drop: often famous names, but also just "Jeff".

And if you spice up the commentary sampling parameters, it gets even more inventive, making up names, and saying that "the rook is pinning Fischer against the king".


Yeah, that's a massive problem with the natural language domain all across machine learning.

Unfortunately it's very difficult to track down training data for chess commentary in the first place, let alone trim down biases. For reference, I was able to gather about 1 million samples, but it really needs a billion.

Hopefully through data augmentation and better general intelligence models we can make better progress on bias issues soon, as that's a huge problem when we start trusting AI models too much in life.


You might be able to kludge a fix to tokenize the output and replace he/him/she/her with them/their. It's not as sexy as the engine outputting the correct words, but it should get the job done.


Yes, in this case as long as they still agree when it actually names people, I don't think it would be too difficult. There may be factors I'm not considering though.

Harder would be more general models like GPT-2 and GPT-3.


Singular "they" doesn't care about the gender of the person named, so it should be good.


Appreciate the honesty here. Pretty wild how natural this model feels with 1 million samples.


Sometimes it seems really accurate (like the cherry-picked GIF in the overview docs) and sometimes really off.

I think for the most part, it knows more than it lets on, but finding the right sampling methods (or better yet, generalized search) to generate the best comments is a tough problem because it's difficult to evaluate quality.

There's some info on the sampling methods here: https://chrisbutner.github.io/ChessCoach/high-level-explanat...


Yeah the auto-linking is just Lichess doing its best, even when the bot's talking nonsense sometimes.

And thank you!


Oh, that message is a little backwards, but the main bot only accepts challenges from 1+0 or 0+1 up to 15+10 time control.

You can challenge https://lichess.org/?user=chesscoachclassical#friend to 30+20.

Unfortunately, neither of them support correspondence.


This took about a year and a half – a little over a year coding in between experiments and training.

It's a chess engine with a primary neural network just like AlphaZero or Leela Chess Zero's, but it adds on a secondary "commentary decoder" network based on Transformer architecture to comment on positions and moves. All of the code and data for training and search is from scratch, although it does use Stockfish code to generate legal moves and manage chess positions.

You can watch it play on Lichess here: https://lichess.org/@/PlayChessCoach/tv or challenge it here: https://lichess.org/?user=PlayChessCoach#friend, and see its commentary in spectator chat. It only plays one game at a time, so you may need to wait a little bit. It's fairly strong (~3450 rating, roughly on par with Stockfish 12 or SlowChess Blitz 2.7), but you can set up a position when challenging it so that it's missing a couple pawns or a piece (Variant: From Position).

I ended up writing much more about it than I expected. If you're into the technical side of chess or machine learning, beyond the linked overview, there's:

High-level explanation: https://chrisbutner.github.io/ChessCoach/high-level-explanat...

Technical explanation: https://chrisbutner.github.io/ChessCoach/technical-explanati... (including code pointers)

Development process: https://chrisbutner.github.io/ChessCoach/development-process... (including timelines, bugs and failures)

Data: https://chrisbutner.github.io/ChessCoach/data.html (including raw measurements and tournament PGN files)

And the code is here: https://github.com/chrisbutner/ChessCoach (C++ and Python, GPLv3 or later)

Happy to answer any questions!


This is a fantastic project. Thanks for sharing!

I had a nice long conversation with two of the authors of [0] at ACL.

One thing we discussed was the reverse problem. That is, as a player, could I give commands to the model and have the engine figure the moves that would best satisfy them.

This ranges from concrete like "take the black square bishop" (there is still variability like which piece should take it or if it's even possible) to more complex positional stuff like "set up to attack the kingside."

Any thoughts on this line of research?

[0] Automated Chess Commentator Powered by Neural Chess Engine (Zang, Yu & Wan, 2019) https://arxiv.org/pdf/1909.10413.pdf


SentiMATE[1] looks at one of the reverse problems in a way - training an engine on commentary data - although it's not exactly what you're talking about.

I think this line of thinking could eventually lead to automated metrics for commentary evaluation, which could in turn lead to better methods than top-k/top-p for turning a bunch of sequential logits into a sentence or paragraph - basically treat it like MCTS/PUCT also.

The problem is that if you look at high-level commentary - maybe Radjabov-MVL on https://www.chess.com/news/view/2021-champions-chess-tour-fi... (I'm not the best judge, just a quick search) - it's not often possible to predict the move starting with the comment. And if you did, you might end up with very dry metrics and reverse commentary.

But this direction has a lot of potential I think, beyond just chess, into more of an algorithmic/generational support for pure NN-based language models.

[1] https://arxiv.org/pdf/1907.08321.pdf


Where did you source the commentary dataset?


Not the author, but ChessBase sell a product (Megabase) which includes 85,000 annotated games in a more-or-less machine readable format. [0]

To me it's probably OK to train a model on this, at least for hobby purposes, though some GitHub Copilot critics might disagree. And a large part of ChessBase's business model is based on ripping off other people's IP and presenting it as their own [1]. But still, I can see why the author might want to be coy about answering this question.

[0] https://en.chessbase.com/post/new-mega-database-2021

[1] https://lichess.org/blog/YCvy7xMAACIA8007/fat-fritz-2-is-a-r...


Seconded. I looked around the writeup site for a bit and couldn't figure that out. That's arguably the most important piece of info about this project.


Can you have it play more games by giving it less time per turn (~2500 rating is plenty good for an opponent/coach) and playing games concurrently while it waits for human to play?

how much does a game cost in CPU time money?

How do I get the commentary for a game I played? Oh, it's in Analysis page.

It plays chess very well, but the commentary is incoherent and doesn't match the game well -- The attacks described are nonsense and the coordinates are wrong. It seems a little confused about which side is which? It thinks a rook can diagonally attack a bishop, and seems to name squares opposite from their actual name.


That's a good idea. A bigger problem than time-slicing is probably GPU/TPU device ownership issues and GPU/TPU memory usage with multiple games going in parallel. There may be some ways to multiplex it intelligently though.

Costs are difficult to work out - it depends on cloud vs. self-hosting, what kind of TPUs/GPUs, how long you're calculating over.

The advantage that classical/NNUE engines have is that they can more easily spread over distributed frameworks like Fishtest.


> the commentary is incoherent and doesn't match the game well > The attacks described are nonsense and the coordinates are wrong.

Agreed, this looks superficially like commentary on the game, but honestly it doesn't seem more pertinent to the game score than a Markov chain trained on all the commentary would be (presumably this isn't true, and the author started with something like that Markov chain and the current version is way better in terms of some fitness function).

I wonder if there just is not enough training data available. GPT-3 overcomes this by harvesting a ridiculous amount of training data. AlphaZero, and the chess engine here, which is excellent, overcome it by generating their own training data through self play. But that's not applicable to the task of generating commentary.


I'm super impressed with what you've managed to create, do you have any further plans with this project? I'm curious now that it's finished and documented to such an extent will you try to bring it publicity and actual usage or was this just a passion project. Thanks


Thank you! I do get that itch to jump in and improve things whenever I see it lose a game, but I don't have further plans (development or commercial) in the near-term. The goal originally was to see whether I liked ML, to decide on my next industry/career move, but there was a lot of "one more month".

I'm actually hopeful that some search techniques such as SBLE-PUCT[1] or better derivations can make their way into other open source projects, but they've had big teams working for a while on similar, often better ideas, so we'll have to see.

[1] https://chrisbutner.github.io/ChessCoach/high-level-explanat...


So, do you like ML?


Haha - I dislike how much of a black box it is, despite the statistical basis (for example, the back and forth on batch normalization rationale). But lots of interesting problems and tech to dig into.


You estimate it’s rated 3400-ish and it loses games????


It loses some games to Stockfish 13 and 14, and Lc0 - rarely at slow time control, and more often at blitz and bullet (actually, it has losses all the way down to Stockfish 9 in blitz).

Partly because of the way it tries to search more widely to avoid tactical traps, it can also be a little sloppy in holding advantages or minimizing losses (this could use some more work and tuning). This ends up making it a little drawish, so it loses less than you'd expect to Stockfish 14, but also doesn't beat up weaker engines as well as Stockfish 14 does.

You can see some of this in the raw tournament results[1]. At 40 moves per 15 minutes, repeating, each engine draws with the ones above and below it, but starts to win and lose at a distance of 2 or 3.

At 5+3 time control, ChessCoach goes 1-0-29 vs. Stockfish 12, but Stockfish 12 is better at beating Stockfish 8-11 than ChessCoach is, so CC ends up between SF11 and SF12 in the end.

On Lichess, where there's no "free time" to get ready for searches, ChessCoach's naïve node allocation/deallocation makes it waste time, and means it can't ponder for very long on the opponent's time - a big opportunity for improvement (it needs a multi-threaded pool deallocator that can feed nodes back to local pools for the long-lived search threads). I think it's also hitting a bug with Syzygy memory mapping that Stockfish works around via reloading every "ucinewgame" (which I don't trigger on Lichess). So, overall, its performance on Lichess is worse.

Also, you can't read too much into this data - very few games, and no opening book.

[1] https://chrisbutner.github.io/ChessCoach/data.html#appendix-...


> It only plays one game at a time, so you may need to wait a little bit

Why this limitation? Is it fairly computationally expensive to run?


Yes, each bot uses a v3-8 Cloud TPU VM, and tries to be constantly playing a game. The search tree is also very memory-hungry. And right now it's also using the Python API for TensorFlow, which is likely wasting a lot of potential.

Lots of room for improvement!


Could using something like AlphaZero.jl make it more efficient?

https://github.com/jonathan-laurent/AlphaZero.jl


The engine itself is in C++, but it calls in to TensorFlow via Python as a portability/distribution vs. performance trade-off.

Next steps could be using one of Lc0's backends for GPU scenarios, or taking the other side of the trade and using the C++ API for TPU.

There's also your typical CPU and memory optimizations that could be made - some baseline work there but not targeted.


I see. I guess compute intensive stuff is usually implemented in c++. By the way, if you don't mind, could you share your experience in learning RL? I am struggling through Sutton and Barto's text right now and wondering if I'll progress faster if I just "dive into things." Also, nice project!


I think it always helps to have a project to apply things to as you're learning something, even if it means coming up with something small. While preparing, I found it helpful to read for at least an hour each morning, and then divided the rest of the day into learning vs. "diving in" as I felt like it.

Getting deep into RL specifically wasn't so necessary for me because I was just replicating AlphaZero there, although reading papers on other neural architectures, training methods, etc. helped with other experimentation.

You may be well past this, but my biggest general recommendation is the book, "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" to quickly cover a broad range of statistics, APIs, etc., at the right level of practicality before going further into different areas (for PyTorch, I'm not sure what’s best).

Similarly, I was familiar with the calculus underpinnings but did appreciate Andrew Ng's courses for digging into backpropagation etc., especially when covering batching.


I found "Foundations of Deep Reinforcement Learning - Theory and Practice in Python" by Laura Graesser and Wah Loon Keng quite helpful in that it was somewhat like get a excellent summary course in about 6 years of RL advancements. I will say that it's quite forthcoming with the math. Anyway, I just wanted to know how they (not sure exactly who did it first, I just meant people with machines) got RL to play Atarti Pitfall. So take any recommendation I make with a grain of salt.


I’d like to self host. Will this run with a gpu?


Yes! I haven't done as much testing with GPU, but did validate running with 4x V100s. You just need to adjust the "search_threads" option to the number of GPUs, but set it to at least 2.

Installation for GPU is covered here: https://github.com/chrisbutner/ChessCoach#installation (a little messy, sorry)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: