Code center > Suggestions

League stats tests

(1/5) > >>

Numsgil:
After several hours of intensive googling and reading my old stats book, here's what I've come up with:

Supposing you have two outcomes (Bot A wins, or bot B wins) and you'd like to test to see if one bot is statistically winning more than the other one, this is the formula you'd use:

Bot A wins if 2 * #wins - #rounds > Z * sqrt(#rounds)

if you want to know how far a current bot is from winning a match, you use the following formula:

Bot A needs (Z * sqrt(#rounds) + #rounds) / (2 * #rounds) wins in order to be declared the winner.

where Z is your confidence interval (the most common being Z = 2 which corresponds to 95%).

This formula only works if #rounds >= 5.

The good news is that the way the league matches works right now is not only statistically and mathematically correct, it's probably the most reasonable trade off between correctness and number of rounds.  The current league uses a confidence interval Z = 2, which corresponds to a confidence interval of 95%.  My hat's off to whoever figured out the statistically valid number of rounds to run at the dawn of DB time.  Most of the stuff I double check in the code isn't really correct.  So I'm impressed.

To expand the current featureset, I would suggest having the following preset round confidence intervals:
* Most out of X trials - Simply award a win to the bot that wins the most out of a preset number of trials.  There are still stats you can do on this, but I haven't played with it yet.
* One Standard Deviation - Corresponds to a Z of 1, and a confidence interval of about 68%.  This would probably be the best for casual runs of the league.
* Two Standard Deviations - what we currently have
* Four Standard Deviations - Corresponds to a Z of 4, and a confidence interval of 99.993%
* Proffessional Test - Corresponds to a Z of 4.4172 and a confidence interval of 99.999.  I believe this is the confidence interval used by manufacting companies when they test to see if a lot is defective or not.  This is really overkill, but you'd be very confidant in your results An interesting point about confidence intervals, it's not that you'll be picking the wrong bot as victor 5% of the time, it's that you'll be prematurely ending the match 5% of the time.  I think.  This stuff gets really screwy in what you can and can't conclude.  But my point is that you're very unlikely to declare an inferior bot a winner.

Last, here's the nice article I found to help me through the stats: Checking if a coin is fair.

Jez:
Thanks for figuring that all out again Nums,

You certainly managed to do it in a lot less time than it took originally.

Matches used to be run as just the best of five rounds before this formula was added.

Jez:
Posted by Griz, moved here by Jez;

I don't disagree with any of the above Nums ...
but these are not the points I have been trying
to bring to attention.

1st:
at what point ... how many rounds ...
are you going to be willing to call a draw a draw?
at what point are we going to stop trying to determine
a winner, and understand that our calculations are
telling us that this a statistical draw?
1000 rounds? 10000?
statistical analysis isn't just about declaring a winner ...
it is much more about declaring if a winner can or can
not be chosen, eh?
if we are obsessed with having one or the other 'Win' ...
then we are not fully utilizing the information our statistical
analysis is providing.
what is this analysis telling you when the number of
rounds goes to a large number without it being able
to declare a clear winner?
it's telling us ... it is a statistical DRAW.
so what is that point? how many rounds?
look at the formula ...
the higher the number of rounds ...
the less and less likely it is that one or other of the
bots in question, if already close in wins, will ever gain
enough rounds over the other to ever have a decision made.
the higher the number of rounds, the higher the 'gap'
between wins/losses must be.
so at some point, there is really no point in continuing.
what is that point?
certainly there must be a formula for that somewhere.

iow ...
how many times do you flip the coin before you can
determine it is not 'weighted' ... is fair, is 50/50?

2nd:
and more important, imo:
that a large error introduced by the initial positioning in the bot's
order will render all of your efforts at these precise calculations
null and void anyway.
if a bot is 'disallowed' from entering a contest with a worthy
opponent, which might require such a lengthy battle ...
and statistical analysis ...
then it does him no good to know he or his opponent
will not be 'falsely chosen'.
he's already been locked out.

so this stuff is all fine ... keep it ...
I don't have a problem with that ...
it's neat stuff ...
AS LONG AS ...
I, as User ... have the option of selecting
maximum number of rounds or control over Z ...
if that is what I want to do.

and then please take a step back from your numbers ...
and take an objective look at the way leagues are currently run
and see what needs to be done so that these efforts you are making
made here with the in depth analysis ...
won't be stomped on by some random arbitrary factor in the way
the league order is initially made.

can you dig it?

shvarz:
Jez (or Griz?), Nums's post exactly answers your questions.  It is all about "how accurate you want to be in calling a winner?"  If all you want to do is get a rough idea, then running 5 matches would be enough.  If you want to be able to detect tiniest differences in bots' fitness, then you need to run many-many matches.  The small the difference and the more certain you want to be that your result is correct - the more matches you'll have to run.


Nums, one point of caution:  The way leagues are run now destroys all your calculations.  They are run based on principle "run until one of the bots is declared a winner or until the maximum number of matches is reached".  The first part "run until one of the bots is declared a winner" is really not the right way to do that.  In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined.  Statistics are simply not done on the principle "repeat experiment until you get the result".

The calculations that you describe are based on the idea that you determine in advance how many repeats you are going to do. So, if you want to do it right, then decide on how accurate you want your measurement to be, then get the necessary number of matches and always run this number of matches for all bots you test.

Jez:

--- Quote from: shvarz ---Jez (or Griz?),
--- End quote ---
My bad, it was posted by Griz, I haven't managed to figure out how to move an individual post from a thread yet though.

I am interested in the points you have made, partly because you were about when this was implemented but mostly because I moved in this direction because the SA of results was something that I remembered from that little bit of A level biology that I studied. Something that you are much more qualified to comment on.

I am most interested in
--- Quote ---The first part "run until one of the bots is declared a winner" is really not the right way to do that.  In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined.
--- End quote ---
As a difference between proving that the coin tossing experiment proves a coin to be weighted as opposed to proving that the 'number of birds returning to nest in a tree each year' is down to more than chance.

If we are looking at it from the 'coin tossing' angle, as the bots might be considered to be, how is
--- Quote ---run(ing) until one of the bots is declared a winner
--- End quote ---
or until the (weighted coin) hypothesis is proved correct wrong? (forgiving the 95% accuracy figure)

If you did do it from a total sample size, what sample size would you use? Or to paraphrase Griz "at which point is it a draw?"
Bearing in mind that some of the competitions are v bots that can't survive on their own would it always be necessary to do it to a minimum sample size? (of say 500 matches)

Navigation

[0] Message Index

[#] Next page

Go to full version