Author Topic: Greven's DNA structure!  (Read 5274 times)

Offline Greven

  • Bot Destroyer
  • ***
  • Posts: 345
    • View Profile
Greven's DNA structure!
« on: May 25, 2005, 11:37:24 AM »
Post not finished YET!!

This post will be a little technical.

I have criticised the DNA structure and DB in general, and I promised to make up something and here is my proposal. It is not fully finished! But what I want, is to discuss it in detail and hear ideas/critics from all DB'ers, because this is very big thing to begin changing.

My main quest for this were to creat a DNA language/structure that:
  • allowed junk DNA.
  • not destroying any existing bots.
  • making mutations work better and implentment them easier(not saying they dont work now, read on!!!)
  • preserving most of DB as it is now.
You must understand that I dont say this is better than the current system and it is [span style=\'font-size:14pt;line-height:100%\']not [/span] perfect, far from perfect, but I like to get opinions, and if you can argue that the current is better, so be it, it is just thought as a suggestion/discussing topic, not a: IF-IT-CANT-BE-IMPLEMENTED-AS-IS-DONT-USE-IT, I am open to constructive critcism.

I will use words like:
  • genome/genotype to identify an entire DNA sequence (Bot DNA)
  • DNA letter or just DNA as a single instruction (like the store command)
  • phenotype as the way the bots ending genome is like (I know this is not what phenotype actually mean, but it is the best word to use right now.)
  • DNAS is short for DNA Structure!

I will try to argue for my point of view, but still there maybe things I havent thought of.

The genome and mutations:

The genome for a bot now is stored in an array. In my DNAS the genome is made up by a single string!
After a bot birth the genome is read and implemented by the main DB program, it will end up in a array, the phenotype.

Example:
We now have a genome looking like this:
"ABDDhFEKFLjgFHFKDADDFDFEIREUROIEr"
(When I mean a single instruction(DNA letter) I mean 'A' or 'B' etc.)

Why:
With mutations it will be alot easier. Instead of having a delete gene, insert gene etc., we only need 3 (maybe 4) different mutation:
  • Insert
    Can insert a single instruction or an entire substring within the genome. This one can have two different functions: inserting instructions already in the genome (and letting the point mutation introduce new instructions) or insert all possible instructions.
  • Delete
    Can delete a single instuction or an entire substring within the genome.
  • Point
    This only works on a single DNA instruction, subtituting it with another DNA instruction.
  • (Revert)
    Works on a substring within the genome, and revert it: example:
    we have "ABCDEFGH", the substring "BCD" will mutate and we get "DCB", the entire genome then look like this: "ADCBEFGH"
Why:
This opens up for junk DNA or DNA we have never seen before in DB, like a ADD in the condition part of the gene, instructions outside genes!
Conditions within conditions! With also get ride of the peculiar genome with spaces in it, I mean empty spaces, I have seen such in DB. (I do not know if it is fixed, never bother to report it, maybe it was an earlier version cant remember).
And we also get a completly new recombinations. You must have in mind, that the gene it self is not a unit anymore (as it is in some arbitary way in the current system), it emerges from the combinations of the DNA instructions.

The phenotype:
The phenotype is the actual behavior or the part of the genome that is executed in the bot.
When a bot is born, the program read through the entire genome and it will decide, through rules we have decide, if there should be a condition within a condition or if it can have a '='-sign within the executed part of the gene or DNA outside the genes should be executed etc. etc. This is all yet to be decided.

If we dont what something (condition within condition) the program just ignores this and only the parts we want gets into the phenotype (the array). The genome is not touched at all, and it is the genome that is passed on to the offspring, but the phenotype that is executed.
It all ends up in a array as it is now.

Why:
This is interconnected with the genome. But it allows us the make certain rules about the execution of the bot. Ex. We dont want a store in the condition part etc., the it is not expressed in the phenotype of the bot, but in is still there, able to act as junk DNA.
It also makes the possiblity that the genome and the phenotype is different from each other, which it is in real life, and just a single insert or delete mutation may be the rise to new and interesting species of bots, because of all the junk DNA it is now possible to have.

The DNA it self
(All in this section, is mainly arbitary picked anything can be used!)

Because DB arciteture is mainly based on direct numbers in the DNA (as opposed to other AL simulations like Avida, but it is difficult to compared DB with Avida),
I could not write an entire DNA system without numbers and still live up to the goals I did set for this system. Therefore the downfalls of this systems lies in the DNA it self.

A sysvar is actualy only a way to make it more readable and a pointer to a specific number, therefore every bot could be written without sysvars.

Say we now have the letter A-Z (26) and the letters a-z (26) as symbols, for different DNA instruction.

The letters A-J is the numbers 0-9. The letter Zis the flow command cond, Y is start, X is < and n is a seperator for numbers, then we could have something like:

Code: [Select]
cond
10 50 >
start
this will end up into something like: BAnFAXY

I hope you get the idea.

But say we have the following genome (in DB language):
Code: [Select]
cond
10 add 50 >
start

then phenotype will be (again in DB language)::
Code: [Select]
cond
10 50 >
start
(If we what no add instruction in the condition part)

The again we could have a-j is the numbers *0-*9, and we could decide that the first number is what the number actual is, so aBC is *012 = *12 etc.

The downfalls:
The main downfall I can see in this system is the DNA. The numbers can change dramaticly without logic, a '0001' could mutate to become 9001, but then again maybe this could help evolution further, I dont know, and it is the natural selection / the evolutions job to find the fittest! And remember that not all mutations are good for the survival of the organism.

This system is not perfect, and if implemented it could endup showing that this system really sucks.

I have relied heavily on a few books I have read about the topic and my own experince with developing AL simulations.

Overall you can still write the old code, and use old bots, because we will creat a small routine to convert the code to the DNA instructions so the program understands it.

Please comment.
« Last Edit: May 25, 2005, 11:42:50 AM by Greven »
10010011000001110111110100111011001101100100000110110111000011101011110010110000
011000011000001100010110010111101001110100110010111100101000001000001111001011101
001101001110011011010011100011110100111000011101100100000100110011010011100110110
010110000011100111101001110110111101011101100110000111101001101001110111111011101
01100100000111010011010001100001110111010000010001001000010100001

Offline shvarz

  • Bot God
  • *****
  • Posts: 1341
    • View Profile
Greven's DNA structure!
« Reply #1 on: May 25, 2005, 11:59:14 AM »
I like it.  It goes along the lines of junk DNA that Nums was proposing a while back, just gives it a more technical side.  Here is what I wrote back then:
Quote
The way I see it, the system should work like this:

we have a "gene execution" flag that can be set to 0, +1, or -1

true condition sets the flag to +1
false condition sets the flag to -1

- when flag is set to +1, program scans for first available "start" and executes the gene, flag is set to 0

- when flag is set to -1, program scans for first available "else" and executes commands after it, then sets flag to 0

- when flag is set to 0, program does not execute neither "start" nor "else", just scans for the next "condition"

This way we can have "cond", "start" and "else" parts in any order we want.

The idea of substituting the genotype with phenotype is good, as it would save the processor cycles.  Also, it would make the phenotype much more readable.

The "string vs array" thing is fine with me.  This is purely a programmer's issue that should not have major effects on the way the program works.
"Never underestimate the power of stupid things in big numbers" - Serious Sam

Offline PurpleYouko

  • Bot God
  • *****
  • Posts: 2556
    • View Profile
Greven's DNA structure!
« Reply #2 on: May 25, 2005, 12:17:59 PM »
I like it too.
Particularly that the genotype will contain possible junk DNA but for the life of the robot it will only see the phenotype which will be the executable code presently used in DB.

We will obviously have to make it work such that saving the robot's DNA actually saves the file as the genotype version. We will probably need to make a small stand-alone decoder utility also, so that we can convert genotype to phenotype and back while coding a robot.

Might even make file transfer quicker and more efficient since the text file of the genotype will be considerably smaller than the existing DNA file.

I would  also like to see the idea that Shvarz outlined above, implemented. As he said, this would allow us to have a much less formalized structure of DNA writing that could lead to some very interesting behaviours.

Good start  B)
There are 10 kinds of people in the world
Those who understand binary.
and those who don't

:D PY :D

Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7742
    • View Profile
Greven's DNA structure!
« Reply #3 on: May 25, 2005, 03:02:02 PM »
Okay, this is me just listing the major points to help clarify it:

Current Version:
DNA is an array of elements that are of the type:

type as integer
value as integer

where type is wether it's a number, control command, etc. and value is which particular type or control command it is.

Greven's idea:

DNA is a string of characters where each letter is a DNA element, where type and value are both determined by the same value.

Okay, now my critique based on the above summary:

1.  If each command is based on a letter, then you are limited to 48 letters.  Or 256 (actually less, since not all character codes are printable) if you use all possible character codes.  I don't know if that's a problem now, but if it ever does become a problem (we add more stuff to the language), then we are in a huge way in trouble.  The only way to fix it would be to change to either two character codes (which is basically what it is now) or change to an integer array (basically what it is now).

2.  You said that in the current version each gene is an arbitrary unit to the program.  However, this is an upper level distinction.  In the bowels of the program it's just an array.

3.  Using a string makes manipulation of the DNA easier (strings have built in insertion and deletion routines).

I see your idea as more a paradigm to how to approach the existing DNA structure in DB than a reason to drastically change the basic structure.  That is, we keep the array of two integer elements.

We get rid of distinctions between conditions, bodies, and outside of bodies.  That's a good idea.  We, as you say, have major types of mutations:
  • Insertion
  • Deletion
  • Single Point Change
  • Duplication
(I don't know that reversal is either realistic or particularly useful, but we can argue that point seperately.)

(Note that deletion and duplication should be allowed to work on large tracks of DNA as well).

However, we should have a subsection within each that controls what type of command (the type of the DNA element) can be inserted/deleted/changed (you, the user, don't have to mess with this if you don't want, unless you want greater control over the mutations).
« Last Edit: May 25, 2005, 03:11:37 PM by Numsgil »

Offline shvarz

  • Bot God
  • *****
  • Posts: 1341
    • View Profile
Greven's DNA structure!
« Reply #4 on: May 25, 2005, 03:28:03 PM »
OK, I have no idea what all those arrays and strings and what-not.

I'll just add that reversion is a very reasonable type of mutations.
"Never underestimate the power of stupid things in big numbers" - Serious Sam

Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7742
    • View Profile
Greven's DNA structure!
« Reply #5 on: May 25, 2005, 03:37:19 PM »
Okay.

Things like reversion and deletion and reversal all can (or need to) operate over lengths of DNA.  How do we pick how large the event will be?

Offline Carlo

  • Bot Destroyer
  • ***
  • Posts: 122
    • View Profile
Greven's DNA structure!
« Reply #6 on: May 25, 2005, 03:46:16 PM »
A few comments on this. Maybe it's not all clear to me, and there are parts that I haven't well understood. Anyway, I'll play as usual the role of the actual darwinbots advocate  :D

>allow junk DNA.

Junk dna is _already_ allowed, as anybody can see just by looking at the code of some evolved bot. Gene duplications and condition mutations can easily create tons of junk genes.

Quote
The genome for a bot now is stored in an array. In my DNAS the genome is made up by a single string!

Which is exactly the same thing. Strings _are_ arrays of characters. Coding the double value (type, value) of the db array to a single value (byte) string would not change things very much. In fact, each actual dna position can be converted in a three characters sequence. Surely two chars for the value (as 16 bits are needed, and a char is 8 bits) plus a few bits for the type.

Now, I have to explain one thing. There is nothing easier than just mutate randomly the dna array (or string), putting conditions and instructions at random. You can just randomly insert, delete, duplicate and mutate values wherever you like to.
The problem is just that, as we're working on a pc which has limited power, we have to try to favour meaningful, effective mutations. This is (among other things) why  I have chosen the genes structure. Because it seemed more easy to mutate without destroying completely the functionality of dnas. And this is also why there are various mutation routines, some acting on whole genes, some _trying_ to deal with complete statements (that is, an instruction with its parameters), some acting just on conditions, and so on. I repeat, you can easily throw away these mutation routines in favour of a single routine which simply creates random mutations without trying to preserve a working dna structure. It is more simple, and more fair: but probably much less effective from the point of view of evolution.

The trick of having a phenotype (which, by the way, has nothing to do with the true concept of phenotype, which darwinbots _already_ has - just think at genes that are unexpressed because their effect is covered by the effect of another gene) is elegant from the point of view of programming - saves calculation time- but it is just an optimization trick, as Shvarz already noted. For example, actual dnas aren't executed as one may expect, that is by scanning the dna for conds, but instead a list of conds is built up at robot's birth, and is checked at each cycle. Optimization, it doesn't really changes the way the things are.

Finally, I don't really get the point of the last part - but maybe I haven't understood what you're saying. Seems you're proposing a token language for the dnas, with instructions expressed with a single letter (that is, a single array position) and numbers expressed in a positional way. Well, in the actual DB instructions already are tokenized (that is, cond isn't really the "cond" string, but is represented by a single position in the dna array). And, well, using a positional notation for numbers would allow breaking a number into its digits by inserting among them, for instance, a new instruction. But this is nothing you can't already do with the appropriate mutation routine: just create a routine which searches for numbers, calculates their decimal digitas, and breaks them the way you want. Nothing easier.
« Last Edit: May 25, 2005, 03:54:07 PM by Carlo »

Offline Greven

  • Bot Destroyer
  • ***
  • Posts: 345
    • View Profile
Greven's DNA structure!
« Reply #7 on: May 26, 2005, 10:37:42 AM »
Replying Nums posts:

Quote
1. If each command is based on a letter, then you are limited to 48 letters. Or 256 (actually less, since not all character codes are printable) if you use all possible character codes. I don't know if that's a problem now, but if it ever does become a problem (we add more stuff to the language), then we are in a huge way in trouble. The only way to fix it would be to change to either two character codes (which is basically what it is now) or change to an integer array (basically what it is now).

Right now we only have 20 commands, when numbers need to be include we have 0-9 and *0-*9, this total up to 40 different letters. Remember we have 26 letters in the english alphabet, with capitalized we get 52, with the numbers 0-9 we get 62 possibly letters to code from. Remember that a sysvar is only a number, and therefore adding direct commands we have 12 possible slots, which should be enough.

Quote
2. You said that in the current version each gene is an arbitrary unit to the program. However, this is an upper level distinction. In the bowels of the program it's just an array.

This I know, but what I mean is a bit complicated to explain. In some ways there is noway that there could exist commands outside genes, what I would really like to have.

Quote
I see your idea as more a paradigm to how to approach the existing DNA structure in DB than a reason to drastically change the basic structure. That is, we keep the array of two integer elements.

What I tried to explain in my first post is that I could not invent a entire new DNA structure/system without destroying much of what we have now, and that I didnt want to, therefore it is only meant to optimize some few things in what we have now.

Quote
Things like reversion and deletion and reversal all can (or need to) operate over lengths of DNA. How do we pick how large the event will be?
This is still open for discussion!

The phenotype is still a array like the DNA is now, but the genotype will be in a string, mainly to optimize mutations, and to have junk DNA.
« Last Edit: May 26, 2005, 10:39:11 AM by Greven »
10010011000001110111110100111011001101100100000110110111000011101011110010110000
011000011000001100010110010111101001110100110010111100101000001000001111001011101
001101001110011011010011100011110100111000011101100100000100110011010011100110110
010110000011100111101001110110111101011101100110000111101001101001110111111011101
01100100000111010011010001100001110111010000010001001000010100001

Offline Greven

  • Bot Destroyer
  • ***
  • Posts: 345
    • View Profile
Greven's DNA structure!
« Reply #8 on: May 26, 2005, 12:11:27 PM »
Replying to Carlo's post:

Quote
Which is exactly the same thing. Strings _are_ arrays of characters
I know strings are just arrays, but in VB strings have built in commands to modify a string, this will make it easier to build mutation routines, which are faster and more reliable, and less prone to bugs etc.

Regarding mutation and the gene structure.
What is a meaningfull mutation? Do evolution know what a meaningfull mutation is? No, but evolution should and certainly would select against non-meaningfull mutations, if mutations do schewup the DNA of a daughter bot, it will not survive and wont reproduce.

In reallife an entire gene aint suddenly added to a genome or deleted, instead a sequence of the DNA is insert. But I do understand the gene structure and why it is there, but it is evolutions work to develop these, I dont think it is more effecient to have the current system (but I dont say my system is better), because of the strict DNA structure, new thing could evolve that we havent thought of, but this must stand its test of time.

The to the phenotype, as I wrote this phenotype definition is much more complex than I do it in my system, but it is close enough here in DB .
But I do not agree on DB has a some sort of phenotype. Yes it is a optimiztion 'trick' but still in allowes much more of the genome to be unexpressed, now everything is express (or nearly everything). In my system mutations can work meaningfull or not, on the genome without altering the behavior of the bot. In the current system we do also have this as you pointed out, but it is in a much more strict sense than mine, which is much broader.

You must understand than 'my' phenotype is actual what DB is now (the current system with the genome etc.), I just make a direct border. Much of the current system is kept in the phenotype, but before a phenotype exist, we have a genome and a function to determine what the phenotype should allow.

And to the last point:
Quote
Well, in the actual DB instructions already are tokenized (that is, cond isn't really the "cond" string, but is represented by a single position in the dna array
This I already know, but the genome is in a array, and that is what my point is about.

I do [span style=\'font-size:21pt;line-height:100%\']NOT[/span] say my system is better. It could endup as I said before, that it is worse, but then so be it! If it is better... well let DarwinBots evolve  ;)
10010011000001110111110100111011001101100100000110110111000011101011110010110000
011000011000001100010110010111101001110100110010111100101000001000001111001011101
001101001110011011010011100011110100111000011101100100000100110011010011100110110
010110000011100111101001110110111101011101100110000111101001101001110111111011101
01100100000111010011010001100001110111010000010001001000010100001

Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7742
    • View Profile
Greven's DNA structure!
« Reply #9 on: May 26, 2005, 12:45:47 PM »
1.  I think that the cost/benefit ratio for switching to a string is just too high.  Having the current system makes every DNA bit, numbers included, a single cohesive unit, which makes understanding and manipulating the DNA easier.

The only benefits to strings (concatenation, insertion) all can be created for an array, it just takes some more effort.  If you get good and solid routines to build on, an array can be just as versatile as a string.

2.  When Carlo was talking about 'useful' mutations, he just meant that some mutations are very unlikely to be useful.  For instance, when changing a sysvar into another, if you just pick a random sysvar you'll get much less useful mutations that selecting a close sysvar.  That's what he was getting at, there are ways to shortcut around obviously deleterious mutations, or mutations less likely to produce favorable results.

For my compelte thoughts on this subject, read My new mutations paradigm.

3.  You can put commands between genes.  Try it out.  So the ability to have Junk DNA already exists.  When we talk about adding Junk DNA, what we are really talking about is 'floating' the genes.  We mean having the ability for the condition and body of a gene to move around independantly, and the start and end markers for both to move around independantly.

4.  There are more than 20 commands.  You have the flow commands, logic operators, actual commands, and comparisons.

Your system would work, but the question is is it worth the effort?  I don't see the structural changes you suggested as adding anything to the program from the bots' point of view.  It just makes DNA manipulation in-program slightly easier.

Offline Carlo

  • Bot Destroyer
  • ***
  • Posts: 122
    • View Profile
Greven's DNA structure!
« Reply #10 on: May 26, 2005, 02:47:16 PM »
Quote
I know strings are just arrays, but in VB strings have built in commands to modify a string, this will make it easier to build mutation routines, which are faster and more reliable, and less prone to bugs etc.

Well, there are already in DB code a bunch of routines dealing with dnas. They provide all the functionality of the three or four (!) strings handling functions that VB offers, but they also are dedicated for dna handling, which means that they provide specific functionality. Please note that recreating the basic features of the string functions that VB provides, in an absolutely reliable way, requires probably one hour of trivial programming.

More, speed is simply a non issue, as mutation routines are called relatively rarely.

Quote
What is a meaningfull mutation?

Well, with meaningful mutation I mean a mutation which produces a meaningful code. For instance, if you have the VB code:

For x=1 to 100
  print "xyz"
next x

then, the mutation

x=1 to 100
  print "xyz"
next x

is NOT a meaningful mutation, because it breaks the syntax of the code.

If you have the DB code

cond *.eye5 > 0 start ... stop

and you mutate in

cond > > 0 start ... stop

this is not a meaningful mutation, because the parser (probably) won't be able to execute it.

So, meaningful mutations are mutations which produce an executable code. Nature works differently, but with at max a few thousands individuals in a microscopic environment, you understand that saving cpu cycles to  execute only well formed dna codes is a good idea.

Said that, I think we could make the language more flexible, allowing for example, operators into conds, etc. The idea is that the more mutations have chances to be meaningful, the better. For instance, we could allow something like

cond *.eye5 3 add 5 > start .... cond *.eye3 5 > start ... stop

where everything can be everywhere, a cond evaluates to true if there is, say, something >0 into the stack top, the execution just starts after a valid cond and stops only when a stop in encountered.

Quote
But I do not agree on DB has a some sort of phenotype. Yes it is a optimiztion 'trick' but still in allowes much more of the genome to be unexpressed, now everything is express (or nearly everything).

I meant to say that what is called "phenotype" in real life is something darwinbots already, to some degree, shows. If you have a gene putting 0 in var x, and after that another gene putting 5 in the same var, the effect will be of completley hiding the first gene. But the gene is there. Cut off the second gene, and the effect of the first will appear again. This is phenotype.
What you are proposing is not phenotype: is simply a trick to speed up dna execution: something that has interest for programmers, not from the point of view of evolution.
« Last Edit: May 26, 2005, 02:48:39 PM by Carlo »

Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7742
    • View Profile
Greven's DNA structure!
« Reply #11 on: May 26, 2005, 03:54:20 PM »
As Carlo said, mutations are not the bottleneck of the program.

The main bottleneck is still the code that checks what bots a bot can see.  This probably takes 25% of the execution time at least, especially as the number of bots increases.  I've been working on lowering this from time to time, but that's where the problem is.

If anyone really cares, I can make a list of what functions take how long, in order.

Offline Carlo

  • Bot Destroyer
  • ***
  • Posts: 122
    • View Profile
Greven's DNA structure!
« Reply #12 on: May 26, 2005, 05:07:00 PM »
Quote
The main bottleneck is still the code that checks what bots a bot can see.

Don't know if it has already been done, but maybe a buckets system would improve things. If thing are still as I programmed them, there is a list of robots which is kept ordered along the x coord, so that checking for collisions and view requires only to check for the nearest robots along the x coordinate. This means that if you have large fields, you may have to check for tons of bots which are way too far along the y coord to see if they are in range for view.
The idea of buckets is simple. Just throw away the ordered list and put in ints place a grid of lists of orbots, each of them linked to the near ones. When a robot moves outside a square of the field, it is removed from a list and put on the neighbour list. When you have to check for view and collisions, you just have to check, for each robot, the robots which are in the same list and in a range of near lists. This means no longer checking for robots which are far away along the Y axis.
« Last Edit: May 26, 2005, 05:07:50 PM by Carlo »

Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7742
    • View Profile
Greven's DNA structure!
« Reply #13 on: May 26, 2005, 05:17:28 PM »
That's not a bad idea.  I'll look at how much work it would be to implement.

I've mostly just been finding ways to streamline the x ordering of the bots and comparisons.  An entirely new structure (a KD tree also comes to mind) might speed up things considerably.

Of course, I've never actually programmed a large and complex system like that yet (that's next semester), so I'd have to dinker around with it for a while.