I've been reviewing the literature on how elo, etc. work. I have some observations:

1. Because we can choose which bots fight which other bots, at the simplest we could choose a random sample (with repeats) of bots for a challenger to fight a single round with. From that we can get its win rate, and a confidence bound on that win rate. So your global win rate is 75% +/- 3%, say. And we can control what the worst case +/- factor is by increasing the sample size. Also as Peter pointed out, winning doesn't have to be binary. You could have an 80% win after 100k cycles because you control 80% of the biomass in the sim, so your overall winrate can factor that in pretty easily.

Your final win rate could be your rating, because it's basically an unbiased sample of your actual win rate if we ran an infinite number of rounds against all other bots, and that's ordered the same way that the elo ratings would be. The disadvantage here is that the win rate will change over time as new bots are added to the league, so your rating is not constant over time. But everytime a challenger is added we only need to run the rounds necessary for it to get its global win rate percentage.

If we anchored a bot with a specific elo (the animal minimalis equivalent has say 1000 elo) we could probably figure out elo ratings from the relative win percentages, I think. Something something math math.

We'd have to rerun the leagues after every new version, though, as the win rates are pulled from old matches and no longer valid in that way. But that could form the seasons Panda was talking about.

2. There are N choose 2 ways to pair off N bots. If each pairing produces a probability that A wins over B (for pair (A,B)), call it P(A > B ) (which is just A's win rate for the A-B match), we can take the inverse of the CDF of the unit normal distribution (call it Phi_Inv) and get a system of N choose 2 equations, and least squares solve it for elo ratings. What I mean is: Phi_Inv(P(A > B )) = (s1 - s2) / (sqrt(2) Beta), where s1 and s2 are the elo ratings of A and B respectively, and Beta is the sqrt of the variance in performance of A and B (assuming each bot has the same variance, which is a big assumption but makes the math easier). That's basically what Elo is trying to approximate. But we have the computing power to calculate it directly.

I'm working through the articles on TrueSkill and Glickman right now; they might be doing something more clever. At the very least they factor in confidence intervals. But I think, because we can run exactly the rounds we want to and no more, and can choose how the matches are chosen, we can do a lot more global optimizing and a lot less incremental updating.

...

But yes, in principle I think the statistical approach is pretty compelling. I think that's the way to go for sure.