Author Topic: League stats tests (Read 13300 times)

Numsgil · « **on:** January 03, 2007, 01:34:44 PM »

After several hours of intensive googling and reading my old stats book, here's what I've come up with:

Supposing you have two outcomes (Bot A wins, or bot B wins) and you'd like to test to see if one bot is statistically winning more than the other one, this is the formula you'd use:

Bot A wins if 2 * #wins - #rounds > Z * sqrt(#rounds)

if you want to know how far a current bot is from winning a match, you use the following formula:

Bot A needs (Z * sqrt(#rounds) + #rounds) / (2 * #rounds) wins in order to be declared the winner.

where Z is your confidence interval (the most common being Z = 2 which corresponds to 95%).

This formula only works if #rounds >= 5.

The good news is that the way the league matches works right now is not only statistically and mathematically correct, it's probably the most reasonable trade off between correctness and number of rounds. The current league uses a confidence interval Z = 2, which corresponds to a confidence interval of 95%. My hat's off to whoever figured out the statistically valid number of rounds to run at the dawn of DB time. Most of the stuff I double check in the code isn't really correct. So I'm impressed.

To expand the current featureset, I would suggest having the following preset round confidence intervals:

Most out of X trials - Simply award a win to the bot that wins the most out of a preset number of trials. There are still stats you can do on this, but I haven't played with it yet.
One Standard Deviation - Corresponds to a Z of 1, and a confidence interval of about 68%. This would probably be the best for casual runs of the league.
Two Standard Deviations - what we currently have
Four Standard Deviations - Corresponds to a Z of 4, and a confidence interval of 99.993%
Proffessional Test - Corresponds to a Z of 4.4172 and a confidence interval of 99.999. I believe this is the confidence interval used by manufacting companies when they test to see if a lot is defective or not. This is really overkill, but you'd be very confidant in your results

An interesting point about confidence intervals, it's not that you'll be picking the wrong bot as victor 5% of the time, it's that you'll be prematurely ending the match 5% of the time. I think. This stuff gets really screwy in what you can and can't conclude. But my point is that you're very unlikely to declare an inferior bot a winner.

Last, here's the nice article I found to help me through the stats: Checking if a coin is fair.

Jez · « **Reply #1 on:** January 03, 2007, 02:39:59 PM »

Thanks for figuring that all out again Nums,

You certainly managed to do it in a lot less time than it took originally.

Matches used to be run as just the best of five rounds before this formula was added.

Jez · « **Reply #2 on:** January 03, 2007, 04:54:26 PM »

Posted by Griz, moved here by Jez;

I don't disagree with any of the above Nums ...
but these are not the points I have been trying
to bring to attention.

1st:
at what point ... how many rounds ...
are you going to be willing to call a draw a draw?
at what point are we going to stop trying to determine
a winner, and understand that our calculations are
telling us that this a statistical draw?
1000 rounds? 10000?
statistical analysis isn't just about declaring a winner ...
it is much more about declaring if a winner can or can
not be chosen, eh?
if we are obsessed with having one or the other 'Win' ...
then we are not fully utilizing the information our statistical
analysis is providing.
what is this analysis telling you when the number of
rounds goes to a large number without it being able
to declare a clear winner?
it's telling us ... it is a statistical DRAW.
so what is that point? how many rounds?
look at the formula ...
the higher the number of rounds ...
the less and less likely it is that one or other of the
bots in question, if already close in wins, will ever gain
enough rounds over the other to ever have a decision made.
the higher the number of rounds, the higher the 'gap'
between wins/losses must be.
so at some point, there is really no point in continuing.
what is that point?
certainly there must be a formula for that somewhere.

iow ...
how many times do you flip the coin before you can
determine it is not 'weighted' ... is fair, is 50/50?

2nd:
and more important, imo:
that a large error introduced by the initial positioning in the bot's
order will render all of your efforts at these precise calculations
null and void anyway.
if a bot is 'disallowed' from entering a contest with a worthy
opponent, which might require such a lengthy battle ...
and statistical analysis ...
then it does him no good to know he or his opponent
will not be 'falsely chosen'.
he's already been locked out.

so this stuff is all fine ... keep it ...
I don't have a problem with that ...
it's neat stuff ...
AS LONG AS ...
I, as User ... have the option of selecting
maximum number of rounds or control over Z ...
if that is what I want to do.

and then please take a step back from your numbers ...
and take an objective look at the way leagues are currently run
and see what needs to be done so that these efforts you are making
made here with the in depth analysis ...
won't be stomped on by some random arbitrary factor in the way
the league order is initially made.

can you dig it?

shvarz · « **Reply #3 on:** January 03, 2007, 05:45:08 PM »

Jez (or Griz?), Nums's post exactly answers your questions. It is all about "how accurate you want to be in calling a winner?" If all you want to do is get a rough idea, then running 5 matches would be enough. If you want to be able to detect tiniest differences in bots' fitness, then you need to run many-many matches. The small the difference and the more certain you want to be that your result is correct - the more matches you'll have to run.

Nums, one point of caution: The way leagues are run now destroys all your calculations. They are run based on principle "run until one of the bots is declared a winner or until the maximum number of matches is reached". The first part "run until one of the bots is declared a winner" is really not the right way to do that. In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined. Statistics are simply not done on the principle "repeat experiment until you get the result".

The calculations that you describe are based on the idea that you determine in advance how many repeats you are going to do. So, if you want to do it right, then decide on how accurate you want your measurement to be, then get the necessary number of matches and always run this number of matches for all bots you test.

Jez · « **Reply #4 on:** January 03, 2007, 06:51:20 PM »

Quote from: shvarz

Jez (or Griz?),

My bad, it was posted by Griz, I haven't managed to figure out how to move an individual post from a thread yet though.

I am interested in the points you have made, partly because you were about when this was implemented but mostly because I moved in this direction because the SA of results was something that I remembered from that little bit of A level biology that I studied. Something that you are much more qualified to comment on.

I am most interested in

Quote

The first part "run until one of the bots is declared a winner" is really not the right way to do that. In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined.

As a difference between proving that the coin tossing experiment proves a coin to be weighted as opposed to proving that the 'number of birds returning to nest in a tree each year' is down to more than chance.

If we are looking at it from the 'coin tossing' angle, as the bots might be considered to be, how is

Quote

run(ing) until one of the bots is declared a winner

or until the (weighted coin) hypothesis is proved correct wrong? (forgiving the 95% accuracy figure)

If you did do it from a total sample size, what sample size would you use? Or to paraphrase Griz "at which point is it a draw?"
Bearing in mind that some of the competitions are v bots that can't survive on their own would it always be necessary to do it to a minimum sample size? (of say 500 matches)

Numsgil · « **Reply #5 on:** January 03, 2007, 08:56:54 PM »

Griz, I'll do some figuring on how to determine if a round is truly a statistical draw. You're right, eventually it should be possible to say that something really is a draw, that the bots are indistinguishable from truely 50/50.

Also, League matches are unbiased, but the way in which league matches are organized and interpreted may not be. But that's really a different issue, isn't it

shvarz, I noticed after doing all this work what you mention. This is sort of fudging the proper way of doing things. The problem is that running it in a more scientific manner would probably require more rounds than we're willing to do.

I'm open to suggestions for removing bias where possible.

Griz · « **Reply #6 on:** January 03, 2007, 09:23:58 PM »

Quote from: shvarz

Jez (or Griz?), Nums's post exactly answers your questions. It is all about "how accurate you want to be in calling a winner?" If all you want to do is get a rough idea, then running 5 matches would be enough. If you want to be able to detect tiniest differences in bots' fitness, then you need to run many-many matches. The small the difference and the more certain you want to be that your result is correct - the more matches you'll have to run.

that's exactly right ...
but no, he didn't answer my questions ...
I had no questions about how statistics or these equations work.
the question was about the 'application' of these methods ...
which you go on to address:

Quote

Nums, one point of caution: The way leagues are run now destroys all your calculations. They are run based on principle "run until one of the bots is declared a winner or until the maximum number of matches is reached". The first part "run until one of the bots is declared a winner" is really not the right way to do that. In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined. Statistics are simply not done on the principle "repeat experiment until you get the result".

right on again.

Quote

The calculations that you describe are based on the idea that you determine in advance how many repeats you are going to do. So, if you want to do it right, then decide on how accurate you want your measurement to be, then get the necessary number of matches and always run this number of matches for all bots you test.

right. so in the case of 100 rounds, and Z of 2 ...
a bot would have to win 60 of the 100 in able to
declare it the winner with 95% confidence.
now if it only won 55, you could still declare it the
winner, but not with that level of confidence.
your options then are to either use a lower Z ...
a Z of 1 allowing you to call 55 a winner ...
with a confidence level of 68% ...
or call it a draw.

EricL · « **Reply #7 on:** January 03, 2007, 09:33:08 PM »

Quote from: Numsgil

I'm open to suggestions for removing bias where possible.

I'm going to stay out of the statistics debate, having built few combat bots myself, other to say that I personally have no issues with the fact that a bot may have to employ multiple strategies to defeat multiple bots ahead of it on the ladder - strategies which the higher ranked bots perhaps never required given their earlier genesis.

One thing I have noticed however that can greatly impact the results of a contest are inequities in the random layout of veggies and contestants. Since there are so few starting bots, thier starting positions relative to each other and to the veggies in the sim can have a huge impact on their ability to utilize veggy energy to gain nrg or numbers and thus how well they perform in the contest. One bot may actually be demonstably better than the other, but the margin may be slim enough that a bad start will inevitably cost it the round. It is not statisticly improbable to get 5 bad starts in a row, which can obviously lead to bogus rankings.

I might suggest we hard code certain restrictions such as starting contestants in very specific locations on opposite sides of the field and exactly placing the initial veggies equa-distant from the competitors, therby giving neither contestant a positional advantage at the beginning of a round.

Griz · « **Reply #8 on:** January 03, 2007, 09:47:49 PM »

Quote from: Numsgil

Griz, I'll do some figuring on how to determine if a round is truly a statistical draw. You're right, eventually it should be possible to say that something really is a draw, that the bots are indistinguishable from truely 50/50.

well see shvarz's and my post above ...
and more below.

Quote

Also, League matches are unbiased, but the way in which league matches are organized and interpreted may not be. But that's really a different issue, isn't it

well yes ...
but it can, and does, indeed affect the ranking ...
certainly when doing a league rerun or initial setup ...
and it can still stop a bot from advancing due to just one other bot having his number.
we don't need to do it that way.
however, I also must admit ... so what?
what are we really using these rankings to determine anyway?
it has little to do with DB as a sim ...
and more to do with Bot Designing.

Quote

shvarz, I noticed after doing all this work what you mention. This is sort of fudging the proper way of doing things. The problem is that running it in a more scientific manner would probably require more rounds than we're willing to do.

well, it may actually reduce them quite a bit.
consider:
we can still go with it set up as is ...
do an initial 5 rounds ... I mean most matches are pretty lopsided ...
one bot winning all 5 ... so no problem there.
and for those matches that need to be extended to determine
a winner, we can leave that alone as well ...
but as I have been trying to suggest all along ...
we might have to put a cap max number of rounds.
just as an example ... call it 40 rounds. with a Z of two ...
a bot would have to win 27 of the 40 rounds to be called
a winner with 95% confidence, yes? or 24 with a Z of one.
so the rounds do get extended, just as they do now ...
but upon reaching 40 rounds ... stop. halt. enough.
either call it a draw or give it to the dude with the
most rounds won, realizing our confidence is being
compromised somewhat ...
[it doesn't mean we are wrong] ... and move on.
that would eliminate all these ridiculously long matches ...
and I believe, actually reduce the time required to run
a league.

Quote

I'm open to suggestions for removing bias where possible.

great

now ...
out of all this ...
playing around with leagues and seeing what
makes them tick ...
I've learned a lot ...
and that's what it's all about anyway ...
imo, ime.

shvarz · « **Reply #9 on:** January 03, 2007, 09:57:13 PM »

I am not a big statistical wiz, I just picked up things here and there and can operate them using some common sense. When I have a serious question, I go and bug some statistician.

What follows is really just a basic primer on statistics, but it seems like some people could use that.

What we are dealing with here is a basic fair-coin story. For any statistics you have to formulate your "null hypothesis", which in our case is "two bots are exactly the same and neither is better than the other".

It is impossible to prove this hypothesis, because the difference between bots may be infinitesimally small. You can run match after match and bots will go head to head, but there is just no limit where you can stop and say "I am sure they are the same". By the same token, you cannot be 100% certain that one bot is better than the other - the results that you got may have been a random fluke.

So this all comes down to "how certain you want to be that one bot is better than the other"? Is it 75% certain, 95% certain, 99% certain or, 99.99999999% certain? In science, the 95% was chosen (rather arbitrarily) as a minimal required level of certainty needed to reject the null hypothesis. Actually, most commonly people talk about a "p value", which should be below 0.05. Let me explain what p-value really means, because people are often confused on that. Let's say you did an experiment and got a p-value that equals 0.05. This means that if the null hypothesis is true and you repeated your experiment many-many times, then in 5% of all experiments the results would look the way they do in your experiment. It is NOT a level of certainty. The level of confidence is a more complicated thing, which has to do with standard deviations of a normal distribution. I don't have time or desire to go into that.

The link that Nums posted is very useful and it even has the thing we need here. They say

Quote

If a maximum error of 0.01 is desired, how many times should the coin be tossed?

n = \frac {Z^2} {4 \, E^2} = \frac {Z^2} {4 \times 0.01^2} = 2500 \ Z^2

n = 2500\, at 68.27% level of confidence (Z=1)
n = 10000\, at 95.45% level of confidence (Z=2)
n = 27225\, at 99.90% level of confidence (Z=3.3)

So, there you go. All you need to decide is how accurate you want your measurement to be and how confident you want to be that you called the match correctly and it will give you the necessary number of rounds that you have to run.

Nums, there are ways where you can test the significance "as you go" and stop once you reach a certain level. The idea is essentially the same, but you have to require much higher levels of confidence in your analysis. You may probably find it if you google for "prospective studies".

Griz · « **Reply #10 on:** January 03, 2007, 10:03:32 PM »

Quote from: EricL

I'm going to stay out of the statistics debate, having built few combat bots myself, other to say that I personally have no issues with the fact that a bot may have to employ multiple strategies to defeat multiple bots ahead of it on the ladder - strategies which the higher ranked bots perhaps never required given their earlier genesis.

Quote

One thing I have noticed however that can greatly impact the results of a contest are inequities in the random layout of veggies and contestants. Since there are so few starting bots, thier starting positions relative to each other and to the veggies in the sim can have a huge impact on their ability to utilize veggy energy to gain nrg or numbers and thus how well they perform in the contest. One bot may actually be demonstably better than the other, but the margin may be slim enough that a bad start will inevitably cost it the round. It is not statisticly improbable to get 5 bad starts in a row, which can obviously lead to bogus rankings.

yes. I have noticed this as well ... part of my concern about actually having
better control over veggie re-population, esp max pop.
some bots go into sort of a holding pattern and don't move about much ...
and if the repopulation/reproduction of veggies doesn't happen to occur
near them, they eventually starve.
I forget which bots ... but whether or not a veggie appeared near them or not ...
was the deciding factor in whether they died out or went on to win.

Quote

I might suggest we hard code certain restrictions such as starting contestants in very specific locations on opposite sides of the field and exactly placing the initial veggies equa-distant from the competitors, therby giving neither contestant a positional advantage at the beginning of a round.

yes. that's something to think about ...
otherwise the 'random' element can easily eclipse our best efforts at
making the rounds statistically valid.
that random' thing can be a very large elephant loose in the room.
I wondered before if we shouldn't also use a given 'seed' when running leagues ...
as an effort to get as much repeatability into it as possible.

btw Eric ...

I've been noticing large numbers of certain bots and veggies that are tied
together ... suddenly 'leaping' in mass from one place to another ...
a big gob of them say in the lower half of the screen ...
suddenly being transported, as a whole organism, to the top ...
or similarly from right to left. sometimes they go back and forth
a number of times.
it is as if one of the tied bots strays below the bottom of the screen ...
and then brings the whole gob with it when it is relocated [wrapped]
to the top, rather than just that bot being wrapped. must have something
to do with ties.

Numsgil · « **Reply #11 on:** January 03, 2007, 11:23:35 PM »

Quote from: Griz

yes. I have noticed this as well ... part of my concern about actually having
better control over veggie re-population, esp max pop.
some bots go into sort of a holding pattern and don't move about much ...
and if the repopulation/reproduction of veggies doesn't happen to occur
near them, they eventually starve.
I forget which bots ... but whether or not a veggie appeared near them or not ...
was the deciding factor in whether they died out or went on to win.

I'm not so keen to immediately dismiss this element of randomness. It's true that simpler bots tend to starve to death at times, but more advanced bots usually have an active search system in place, that either makes the bots move around looking for food, or reproduces and spreads out as many bots as possible to find some veggies. Why should we coddle bots that can hardly survive?

You should only see bots have an unfair selection of veggies near them in all 5 rounds 1 in every 32 matches, which is about 3% of the time. I think that's reasonable, but if we want to I would extend the minimal number of rounds instead of setting things up nice and easy for the bots.

Quote

I've been noticing large numbers of certain bots and veggies that are tied
together ... suddenly 'leaping' in mass from one place to another ...
a big gob of them say in the lower half of the screen ...
suddenly being transported, as a whole organism, to the top ...
or similarly from right to left. sometimes they go back and forth
a number of times.
it is as if one of the tied bots strays below the bottom of the screen ...
and then brings the whole gob with it when it is relocated [wrapped]
to the top, rather than just that bot being wrapped. must have something
to do with ties.

This is a "feature". Basically, the "wrapping" from left to right or up to down is really just a teleporter. And to prevent problems from having multibots straddle these borders, the entire multibot is teleported.

It wouldn't take that much effort to fix to be done better, it's just never been done.

Jez · « **Reply #12 on:** January 04, 2007, 03:54:53 AM »

I'm also not keen to change the randomness of bots start positions. As PY said when the idea was suggested before; "I'll just design bots that don't have such a random 'starting' search then"
If you think it is a problem it might be better to look for other solutions, such as increasing the number of veg while giving them less energy, something that has already been done once, successfully, to help some of the bots that had problems with the relative size difference of the sim now that bots change size.

EricL · « **Reply #13 on:** January 04, 2007, 11:20:01 AM »

It's not really a big issue for me, but just to make my position clear, I have no objection to the starting positions of bots or veggies being random and different each round. My objection is with them being inequitable. I will make the unsubstantiated claim that in closely matched contests, starting position inequities can dwarf combatant competence in determining contest winners, dramtacily increasing the number of rounds needed to determine a statistical winner. Even a few cycles advantage in reaching a veggy or a positional advantage that provides for an initial 2 on 1 enguagement will influence the outcome given how few starting bots we are talking about. He who strikes first may not need to strike again. My main point is that were we to remove these inequities, determining a statisticly valid winner would requre fewer rounds in general and contest results would likly be more repeatable.

It would not be too difficult to modify the current league code to position combatants and veggies in such a way that positional advantages were eliminated yet insured no contest had the same starting conditions. Failing this, increasing the starting number of combatants would help reduce (but not illiminate) inequities since a single bot would represent a smaller percentage of your total force.

Light · « **Reply #14 on:** January 04, 2007, 12:17:12 PM »

Would eliminating the bots ability to repro reduce inequalities? would make the fights based more on bot strength than numerical superiority, would be interesting to see the results from a league based on this idea, I like Ericl's main point it sounds a good solution to the problem if it can be implemented without too much trouble

Darwinbots Forum

News:

Author Topic: League stats tests (Read 13300 times)

Numsgil

League stats tests

Jez

League stats tests

Jez

League stats tests

shvarz

League stats tests

Jez

League stats tests

Numsgil

League stats tests

Griz

League stats tests

EricL

League stats tests

Griz

League stats tests

shvarz

League stats tests

Griz

League stats tests

Numsgil

League stats tests

Jez

League stats tests

EricL

League stats tests

Light

League stats tests