Code center > Suggestions

League stats tests

<< < (2/5) > >>

Numsgil:
Griz, I'll do some figuring on how to determine if a round is truly a statistical draw.  You're right, eventually it should be possible to say that something really is a draw, that the bots are indistinguishable from truely 50/50.

Also, League matches are unbiased, but the way in which league matches are organized and interpreted may not be.  But that's really a different issue, isn't it

shvarz, I noticed after doing all this work what you mention.  This is sort of fudging the proper way of doing things.  The problem is that running it in a more scientific manner would probably require more rounds than we're willing to do.

I'm open to suggestions for removing bias where possible.

Griz:

--- Quote from: shvarz ---Jez (or Griz?), Nums's post exactly answers your questions.  It is all about "how accurate you want to be in calling a winner?"  If all you want to do is get a rough idea, then running 5 matches would be enough.  If you want to be able to detect tiniest differences in bots' fitness, then you need to run many-many matches.  The small the difference and the more certain you want to be that your result is correct - the more matches you'll have to run.
--- End quote ---
that's exactly right ...
but no, he didn't answer my questions ...
I had no questions about how statistics or these equations work.
the question was about the 'application' of these methods ...
which you go on to address:

--- Quote ---Nums, one point of caution:  The way leagues are run now destroys all your calculations.  They are run based on principle "run until one of the bots is declared a winner or until the maximum number of matches is reached".  The first part "run until one of the bots is declared a winner" is really not the right way to do that.  In fact, it is so bad that in scientific circles it is equivalent to falsifying the data and will get your papers withdrawn and reputation ruined.  Statistics are simply not done on the principle "repeat experiment until you get the result".
--- End quote ---
right on again.

--- Quote ---The calculations that you describe are based on the idea that you determine in advance how many repeats you are going to do. So, if you want to do it right, then decide on how accurate you want your measurement to be, then get the necessary number of matches and always run this number of matches for all bots you test.
--- End quote ---
right. so in the case of 100 rounds, and Z of 2 ...
a bot would have to win 60 of the 100 in able to
declare it the winner with 95% confidence.
now if it only won 55, you could still declare it the
winner, but not with that level of confidence.
your options then are to either use a lower Z ...
a Z of 1 allowing you to call 55 a winner ...
with a confidence level of 68% ...
or call it a draw.

EricL:

--- Quote from: Numsgil ---I'm open to suggestions for removing bias where possible.
--- End quote ---
I'm going to stay out of the statistics debate, having built few combat bots myself, other to say that I personally have no issues with the fact that a bot may have to employ multiple strategies to defeat multiple bots ahead of it on the ladder - strategies which the higher ranked bots perhaps never required given their earlier genesis.

One thing I have noticed however that can greatly impact the results of a contest are inequities in the random layout of veggies and contestants.  Since there are so few starting bots, thier starting positions relative to each other and to the veggies in the sim can have a huge impact on their ability to utilize veggy energy to gain nrg or numbers and thus how well they perform in the contest.  One bot may actually be demonstably better than the other, but the margin may be slim enough that a bad start will inevitably cost it the round.  It is not statisticly improbable to get 5 bad starts in a row, which can obviously lead to bogus rankings.

I might suggest we hard code certain restrictions such as starting contestants in very specific locations on opposite sides of the field and exactly placing the initial veggies equa-distant from the competitors, therby giving neither contestant a positional advantage at the beginning of a round.

Griz:

--- Quote from: Numsgil ---Griz, I'll do some figuring on how to determine if a round is truly a statistical draw.  You're right, eventually it should be possible to say that something really is a draw, that the bots are indistinguishable from truely 50/50.
--- End quote ---
well see shvarz's and my post above ...
and more below.

--- Quote ---Also, League matches are unbiased, but the way in which league matches are organized and interpreted may not be.  But that's really a different issue, isn't it
--- End quote ---
well yes ...
but it can, and does, indeed affect the ranking ...
certainly when doing a league rerun or initial setup ...
and it can still stop a bot from advancing due to just one other bot having his number.
we don't need to do it that way.
however, I also must admit ...  so what?
what are we really using these rankings to determine anyway?
it has little to do with DB as a sim ...
and more to do with Bot Designing.

--- Quote ---shvarz, I noticed after doing all this work what you mention.  This is sort of fudging the proper way of doing things.  The problem is that running it in a more scientific manner would probably require more rounds than we're willing to do.
--- End quote ---
well, it may actually reduce them quite a bit.
consider:
we can still go with it set up as is ...
do an initial 5 rounds ... I mean most matches are pretty lopsided ...
one bot winning all 5 ... so no problem there.
and for those matches that need to be extended to determine
a winner, we can leave that alone as well ...
but as I have been trying to suggest all along ...
we might have to put a cap max number of rounds.
just as an example ... call it 40 rounds. with a Z of two ...
a bot would have to win 27 of the 40 rounds to be called
a winner with 95% confidence, yes?  or 24 with a Z of one.
so the rounds do get extended, just as they do now ...
but upon reaching 40 rounds ... stop. halt. enough.
either call it a draw or give it to the dude with the
most rounds won, realizing our confidence is being
compromised somewhat ...
[it doesn't mean we are wrong] ... and move on.
that would eliminate all these ridiculously long matches ...
and I believe, actually reduce the time required to run
a league.


--- Quote ---I'm open to suggestions for removing bias where possible.
--- End quote ---

great

now ...
out of all this ...
playing around with leagues and seeing what
makes them tick ...
I've learned a lot ...
and that's what it's all about anyway ...
imo, ime.

shvarz:
I am not a big statistical wiz, I just picked up things here and there and can operate them using some common sense.  When I have a serious question, I go and bug some statistician.

What follows is really just a basic primer on statistics, but it seems like some people could use that.

What we are dealing with here is a basic fair-coin story.  For any statistics you have to formulate your "null hypothesis", which in our case is "two bots are exactly the same and neither is better than the other".  

It is impossible to prove this hypothesis, because the difference between bots may be infinitesimally small.  You can run match after match and bots will go head to head, but there is just no limit where you can stop and say "I am sure they are the same".  By the same token, you cannot be 100% certain that one bot is better than the other - the results that you got may have been a random fluke.

So this all comes down to "how certain you want to be that one bot is better than the other"?  Is it 75% certain, 95% certain, 99% certain or, 99.99999999% certain?  In science, the 95% was chosen (rather arbitrarily) as a minimal required level of certainty needed to reject the null hypothesis.  Actually, most commonly people talk about a "p value", which should be below 0.05.  Let me explain what p-value really means, because people are often confused on that.  Let's say you did an experiment and got a p-value that equals 0.05.  This means that if the null hypothesis is true and you repeated your experiment many-many times, then in 5% of all experiments the results would look the way they do in your experiment.  It is NOT a level of certainty. The level of confidence is a more complicated thing, which has to do with standard deviations of a normal distribution.  I don't have time or desire to go into that.

The link that Nums posted is very useful and it even has the thing we need here.  They say

--- Quote ---If a maximum error of 0.01 is desired, how many times should the coin be tossed?

    n = \frac {Z^2} {4 \, E^2} = \frac {Z^2} {4 \times 0.01^2} = 2500 \ Z^2

    n = 2500\, at 68.27% level of confidence (Z=1)
    n = 10000\, at 95.45% level of confidence (Z=2)
    n = 27225\, at 99.90% level of confidence (Z=3.3)
--- End quote ---

So, there you go.  All you need to decide is how accurate you want your measurement to be and how confident you want to be that you called the match correctly and it will give you the necessary number of rounds that you have to run.

Nums, there are ways where you can test the significance "as you go" and stop once you reach a certain level.  The idea is essentially the same, but you have to require much higher levels of confidence in your analysis.  You may probably find it if you google for "prospective  studies".

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version