Thursday, November 12, 2020

Testing chess-engines part 2

Corona has a serious impact on our social and for many also their professional lives. The damage is getting bigger by the day and no doubt for many it won't be easy to recover financially nor emotionally. However even in the darkest periods there are also positive elements. Some time ago Belgian newspapers reported that high-school-students achieved slightly higher grades compared with last years see e.g. higher success-rates in French high-schools despite corona-virusno corona-effect at university of Brussels as students perform better at exams than previous yearDespite corona success-rate is higher than usual in Leuven.

Nonetheless students were warned in advance that professors wouldn't make their exams easier. Also some experts thought that missing many classes would prevent the students to understand sufficiently the courses. However I think that the lack of distraction due to corona, probably pushed students to study more than usual.

I think this was a smart decision. I notice that this crisis also creates opportunities to start new projects which in normal times would be hard or even impossible. Also for chess-programmers we see an increase of activity. The progress accelerated this year as we got several important updates from Leela and Stockfish. I find it rather amazing that after all those years we still see so much progress as I expect most people already don't realize anymore how strong the current best commercial engine has become. Therefore I thought it could be once interesting to visualize this progress in a small graphic.

This youtube-movie shows a more detailed overview of the evolution of the best engines. There exist others but the message is always the same. In a bit more than 3 decades the engines have evolved from very weak to insanely strong and very difficult to grasp for a human.

At TCEC there is even a permanent running gag about how often somebody requests for a match between Carlsen and the computer. In my graphic above you can clearly see when the strongest commercial engine has definitely surpassed the level of the strongest human. In 2006 it was Rybka making any further matches between humans and engines futile and the gap only increased since then.

Testing of the best engines only makes sense today between each other. Last year I wrote in part 1 that I liked executing those tests but they were too time-consuming so something I wouldn't repeat often anymore. Naturally the corona-crisis suddenly erased my calendar and allowed me to pick up again this hobby. In the last year I organized a dozen of matches each of them consisting of 100 rapidgames  (15min + 10sec) using different computers between each time newer and stronger engines.

It is hard to deduct from above table how much the progress of the strongest commercial engine was in the last year. Therefore I also did a comparison between Leela v22 (end of last year) and Leela v26 (now) with Komodo 11 on my new laptop. The result was amazing. Last year I was already impressed by the score of 62,5 - 37,5 in favor of Leela but this is small beer compared with the new score of 75 - 25 of the more recent Leela-version. That is about 100 TPR extra. In other words it is time to update Leela if you are still working with a version of last year (the best test-results on my computers were achieved by v0.26.1 with network J92-210).

It is try and error to find the best version of Leela. Some tests with more recent versions performed worse so you never know in advance if you should do an update or not. Anyway it also largely depends on which hardware (graphical card) you are using. That is also why I keep track in my tests of which hardware I had been using for it.

We see that my most recent version of Stockfish profits more from my new desktop than Leela. After I swapped my old laptop last year, I decided last month to also upgrade my old desktop (only 4 years old but it had a very bad graphical card and I encountered often problems with the memory). I notice Stockfish achieves 100-200% more nodes on my new desktop compared with my new laptop. Leela only gains about 50% more nodes.

So progress happens on the software and the hardware. Besides it becomes harder and harder to measure properly this progress. You can also see in my tests that the drawing-rate in my matches keeps getting closer to 100. This corresponds to what I described in my last article that the closer we get to perfection, the more draws we see. Even using obligatory openings starts to lose its efficiency.
With above table I keep track of which openings are interesting and which are not.  Green is fine. Orange means that the opening needs to be checked more carefully. Once it is red then the opening needs to be replaced. That happens when in 4 consecutive games with the same opening the same color wins or when in 8 consecutive games with the same opening the result was each time a draw. After my last match I have to replace 22 out of 50 openings because of those conditions.

At TCEC Nelson Hernandez and Jeroen Noomen are continuously looking for openings which allow optimal testing of the engines. This becomes an ever growing challenge. After my first cycle (4 matches) I only needed to replace 3 openings. After the second cycle I replaced 15 openings and now it are 22 of them which aren't useful anymore. I had hoped to see the reverse after I already removed the bad openings earlier. In any case the super-final of TCEC Season 19 in which Stockfish won with 9 points extra was clearly a nice job of selecting interesting openings.

Probably some readers will wonder why I am still organizing those matches. Today you just download Stockfish 12 and you can start the analysis. That is correct for now but a couple of months ago this version wasn't available yet. I mean that new releases are popping up at a rapid pace and it is very easy to miss the best engines. Last year till September I was still using Komodo 11 for my analysis. I believe currently my analysis is 200 points stronger and even at my level this makes a (modest) difference while preparing for a standard on the board game.

Also during the tests I noticed that there exists only about 60% overlap between the moves recommended by Leela and Stockfish. Stockfih 12 is surely sufficient but Leela still gives you at some moments some extra useful input. Anyway testing engines is also fun to watch and this is something which I welcome in times of corona.


No comments:

Post a Comment