lichess.org
Donate

Is Eval by Time enough to determine Elo?

It feels a bit like the article jumps in the middle of things, i would really like to get "eval by time" explained first
Yeah basically it's the graph that's drawn after engine analysis right? You'd think this would allow you to at least pick out low rated players where the graph constantly switches from -8 to +11 to M-5 and back to +5 then M-3 again. Games of titled players only very rarely look like that.
Hypothesis: It is possible to tell a player's title just by examining (time + win%) over a short series of games. It might not be as simple as doing a linear regression though; you might need to model:
- blunders in "winning/drawn/lost" positions with/without time pressure in opening/middlegame/endgame
- mistakes in "winning/drawn/lost" positions with/without time pressure in opening/middlegame/endgame
- inaccuracies in "winning/drawn/lost" positions with/without time pressure in opening/middlegame/endgame
Maybe changes in eval would be easier to the model to process then the eval itself
I think that using 1+0 bullet games to train a model is problematic because of the sheer amount of noise. Generally the quality of the games is pretty low, and at short, non-increment time controls, there will be the time scramble phase where even strong players will make moves that are objectively bad to try to flag the opponent (not to mention other confounding issues like strong players making bad blunders, or weaker players accidentally making really strong moves). The fact that the model is relying heavily on opponent rating shows that the model is "cheating" because I would imagine that most games take place between opponents whose ratings are at least somewhat similar.

No guarantees any of this will help, but you might consider: (1) only analyzing portions of the games where the players have more than 10 seconds, or some arbitrary threshold of time above the time scramble chaos phase, (2) if such a data set exists, try to analyze 1+1 bullet games instead of 1+0 bullet games, (3) maybe remove portions of the game once one side is completely winning. a player rated 2000 in bullet will probably checkmate a lone king with 2 queens using premoves the exact same way a GM would, (4) maybe remove some or all of the opening moves of a game because players following theory will make good moves in roughly equivalent time, so it's a possible confounding factor in determining the elo of a player (i.e. in bullet, weaker players might appear stronger than they actually are in the opening phase because they're following theory, or playing an "automatic" system like the london, benko, king's indian attack setup, etc where they play the first 10 or so moves fairly quickly, and make objectively strong moves)
Have you tried RNN, 1-Dimensional CNN, or other sequential neural network architectures to process the time series?
Evaluation variability between moves should be a strong indication of both players' ELO (large blunders on both sides, which more experienced players should not make.)

How about adding handcrafted features to your regression model: std(evaluation), or even mean(abs(eval_n - eval_n-1)) to measure differences between moves.

Given the models you use, it might be interesting to explicitely add features such as "time to mistake in opening", basically, how long it takes for the eval bar to swing to more than one or minus one, indicating a mistake on one side.

Finally, you could detect the number of moves made once mates are detected by the engines, indicative of smaller ELOs ?

Now if you want to add the "time dimension", I am sure you could add features such as "average time per move", or even decompose this per game phase ?

It seems to me a bit of work on the feature end could go a long way in your case. The rule of thumb usually is, "if a human can see a pattern, ML algos should be able to see it too".

Good luck !
Usually you want to predict Elo from how many inaccuracies/mistakes the player makes. Your model doesn't attempt to do that, does it? For that you need a different dataset - not the graph of eval, but the graph of eval change *after one specific player's turn*.