Data-mining is a technique used to find patterns (correlations) in large amounts of data. This is not only useful for commercial companies, for example, who want to know whether they can still use their chessboards for their current target group of old grey men, or if they may need to tap generation Z. It is not immediately about finding first-order relationships, but second- and even third-order relationships can also be useful.
This blog has already shown numerous examples of thorough research, but I think the host of this blog would certainly have gotten results faster here and there, with a more advanced search-engine, to find patterns, news, profit twists, original statements in a more automatic way.
What we can do today with the Chessbase's filters is already fun - and Chessbase is slowly moving towards more functionality, but we are not there yet. Chess Query Language (CQL), which makes much more possible on pgn databases, is a further step, but there are limits to that too. To do real data-mining on chess games, * all * details of the game should be known: not only all moves, but also all pawn-formations (on each move), all positional features (such as double rooks, bishop-pair, immobilized piece, ...), all maneuvers (fake-sacrifice, knight on the edge, ...) and all threats (smothering mate, fork, ...), but also the reflection-times used (impact of time-scarcity!), all data of the players, ... Only then is an uniform check possible, such as the relationship between a won rook endgame (due to threat of switching to a won pawn-endgame, elo-strength, opening, game-progress and e.g. the age of the players.
The question is, of course, whether such a thorough analysis can add anything. Maybe to identify general trends, but in my opinion certainly not to help with game-preparation. Even if you find out that your opponent plays his knight-endgames badly, you are not going to play a second-choice move in the middle-game just to get into a knight-endgame, I think.
Chessbase already shows relations between the opening-line and the endgames that typically result from it, and that in itself is very useful. But I got the idea of data-mining when I accidentally discovered a very large win-percentage in an opening-variant at computer-games years ago. The position just after the opening was the same, but white won almost all matches. It concerns this line:
[Event "?"] [Site "?"] [Date "????.??.??"] [Round "?"] [White "?"] [Black "?"] [Result "*"] 1. e4 e5 2. d4 exd4 3. Qxd4 Nc6 4. Qe3 Nf6 5. Bd2 Be7 6. Nc3 d5 7. exd5 Nxd5 8. Qg3 Nxc3 9. Bxc3 Bf6 10. Bb5 *
My database of CCRL-games consists of 38 games with this variant. Black wins 3 of them, there are 11 draws, so White wins 24 (!). The problem in the position is that many engines entered the sequence: 10… Qd5 11.Bxf6 gxf6 12.Bxc6 + Qxc6 13Qg7 Ke7 14.Qxh8 and White usually won. Well, I have to admit that the ratings were usually respected, so this was also an example of statistical coincidence. Such a filter exists already in Chessbase: on a selection you can check which variants (ECO-codes in particular) score the best (something that also allows Lichess to do with your own games).
Several points are illustrated with this example: 1) there are still very nice things to be found in computer-databases; 2) always interpret the results of a filtering (it's not because a variant that Walter Browne often lost is bad, because Browne was a notoriously time-trouble-addict) - you also have this problem with computer-games: some engines have a better time-management algorithm, or are tactically better than the opponent if the reflection-time becomes very short.
Regarding statistical coincidence, I would like to add this: once - in the distant past - Fritz and Junior played a real “computer candidate match”. Professor Enrique Irazoqui organized the match in Cadaques (The gospel according to Enrique Irazoqui). The intention was to select a “challenger” to play a match against Kramnik in Bahrain in October 2002 (see Brains in Bahrain and 32-bit op 64 velden). There was a lot of controversy, because Fritz and Junior were handpicked by Chessbase, and other engines (Rebel, Hiarcs, Shredder and other (sub-) toppers from that period, were simply ignored). Junior started that match with 5 wins over Fritz, but Fritz straightened the match over 24 games and won the play-off. The games themselves can hardly be found on the internet, but the reports are fortunately still there: twic339, Kramnik versus Deep Fritz 2002 and games.onlinesupplement2.
In other words, if the sample-size of these games had not been 24 games, but only six or twelve, the result of the match would have been different. There was already a lot of discussion during the match about the settings of Junior and Fritz, to explain the 5-0 start, and even more so when Fritz drawn the match- let alone had this happened in a match between two people. Hence the criticism of the ever-shorter World Cup matches: players no longer take any risks, because once in the lead in a match over 12 games, for example, then it is only goalkeeping (which was once Fischer's great criticism of a match) with a fixed number of games). Many World Cup matches (and long tournaments) have shown that, for example, fitness is also an element that carries over into the strength of a player. For example, Rubinstein was a diesel, while other players just weakened if it took "too long". But now we are already a long way from the starting-point.