Darwinbots Forum

Code center => Darwinbots Program Source Code => Topic started by: Botsareus on February 07, 2014, 02:08:30 PM

Title: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 02:08:30 PM
I am currently using the original DNA structure:
Code: [Select]
  DNA() As block            ' program array
...
Type block
  tipo As Integer
  value As Integer
End Type

All my attempts to write a fast algorithm in vb6 to compare and calculate how different two DNAs are where futile.

I need a fast algorithm to compare two DNAs. If it has to be low level, that is fine. As long as I can call an API to it. This kind of DLL may be useful for DB3 as well, that is why I posted here.

Also, if Percentage to the nearest Whole Number is faster the Floating Point, please do Percentage. That is fine.

Thank you Numsgil, I hope you can pick your brain from your awesome video game programming experience and pull out something for me.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 02:19:43 PM
My first thought is some kind of modified two character Levenshtein distance. I can expand on this process if you tell me more about what you consider similar DNA.

1) Should the block be considered distinct if just the value is different and the tipo is the same.
2) Should the block be considered distinct if just the tipo is different and the value is the same.
3) Should the block be considered distinct if either the tipo or the value is the same edit: different.

The nice thing about the Levenshtein distance algorithm is you should be able to find a very efficient solution in almost any language.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 02:44:34 PM
The block should be considered distinct if either the tipo or the value is different. A change in any of these data points constitutes a different dna statement. There could be runs of identical DNA and runs of different DNA. I attempted to port windiff logic as much as possible just by guessing.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 03:10:42 PM
Okay, that's what I would have guessed. Instead of Strings (which can be thought of as character arrays), you'll have the two complete DNA sequences as your strings. The only major modification you should have to make to the standard algorithm is instead of comparing a single character, you'll have to compare two Integers and only consider that block (character) equivalent if both tipo and value are the same.

Another option, if you have access to double-wide data types, would be to store both the tipo (What does this stand for?) and the value in a single Double (or whatever it might be called) and then you would only have to compare a single data point. This approach would likely be less efficient though.

I'd offer up some pseudo-code, but I don't really know Visual Basic which is the language of choice for DarwinBots if memory (and your code sample) serves. For the standard String comparison though, Rosetta Code has an example. http://rosettacode.org/wiki/Levenshtein_distance#Visual_Basic_.NET
I can't really comment on the efficiency of that implementation but it might be enough to get you started.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 03:16:57 PM
Quote
tipo (What does this stand for?)

We brake up dna into command types. Some examples are:

*.sysvar, .sysvar, basic command( add, subtract, etc.), advanced command (angle, distance, etc.)



I wonder if there is any low level tricks to make it faster though.
Also, the code returns a matrix of strings not a ratio.
Also, why are we doing "Math.Min" sounds like data modification to me.

Thanks for your help btw.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 03:21:13 PM
So that's just Spanish for type then? I'm guessing that dates all the way back to Carlo's original code.

I was trying to interpret it as some special abbreviation, and couldn't come up with anything.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 03:26:12 PM
Italian actually  :P
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 03:34:34 PM
Anyway, I am going to read up on Levenshtein distance on Wikipedia. If I still do not know what to do with the resulting matrix after that, I'll be asking.
I still want to get Numsgil view on these though.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 05:01:13 PM
A couple of points about the edit distance produced by the Levenshtein algorithm:

I looked up and there is a double-wide integral type for Visual Basic (LONG and ULONG), so my suggestion to simplify the algorithm (with at least a small hit to performance) stands as an alternative. Each block could become a single value for comparison (I have virtually no experience with Visual Basic so I apologize if the code below is gibberish):
Code: [Select]
Dim nucleic As Long
nucleic = tipo << 32          ' Edit: You will probably need to cast tipo as a long for this to work, based on a cursory glance at the MSDN pages
nucleic = nucleic Or value
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 05:11:36 PM
<< is not supported in vb



See picky. I kinda get what I am looking at. What I don't understand is what to do with this data to get a ratio of how different two strings are. Where 100% will be the same and 0% completely different (as an example.)  I also suspect there is optimization that can be done based on the type of result I need. Thx.


edit: Had to remember to actually upload my picky
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 05:24:59 PM
I highlighted the final edit distance in the picture. To convert to a percent difference, I think the most intuitive way would be to run through the algorithm and then do: percentage = edit_distance/size_of_longest_array * 100 percentage = (1 - edit_distance/size_of_longest_array) * 100

So in the kitten:sitting example you'd get:
Code: [Select]
percentage = (1 - 3/7)*100 = ~58%
And in the Saturday:Sunday example you'd get:
Code: [Select]
percentage = (1 - 3/8)*100 = ~63%
Edit: Fixed the example to correspond to your request. Percentage now represents similarity rather than difference.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 05:34:50 PM
<< is not supported in vb

This MSDN page (http://msdn.microsoft.com/en-us/library/7haw1dex(v=vs.120).aspx) told me that it is present in all versions of VB dating back to VS2005. I guess you can't always trust what you read on the internet.  :blink:
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 05:54:54 PM
Wow, this is really cool. I will have to play with it later. Thank you for your help.
Title: Re: Need fastest possible code to compare dna
Post by: Peter on February 07, 2014, 05:55:46 PM
Isn't VB6 older than that.  :P
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 07, 2014, 06:00:08 PM
1998. But don't forget that as Testlund pointed out, it took us 10 years to reach alpha.
Title: Re: Need fastest possible code to compare dna
Post by: JRaccoon on February 07, 2014, 08:00:08 PM
Oh gosh. I forgot that you were developing in the "ancient" VB6. Of course that makes more sense with the values in DB having a range of -32,768 to 32,767.

Luckily, "<< 16" can be replicated by multiplying by 2^16 (Using 16 instead of 32 because VB6 Integers are half the size of VB.NET Integers).

Code: [Select]
Dim nucleic As Long
nucleic = tipo * 2^16          ' Edit: You will still probably need to cast tipo as a long for this to work
nucleic = nucleic Or value

And just so that I'm clear, combining the tipo and value isn't required for the algorithm to work. You can also compare them individually and if either are different, increase the local edit distance.

I also remembered about the Damerau?Levenshtein distance which can handle transposition of adjacent elements. I don't think you'd want this in your particular case, but I thought I'd throw it out there.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 08, 2014, 11:08:31 AM
I'll play with both versions and see which one is faster.

You never lose love in developing in your first programming language, especially as a hobby.  :P

Title: Re: Need fastest possible code to compare dna
Post by: Numsgil on February 08, 2014, 01:50:22 PM
I believe there's already something for sex repro to match up how similar two DNAs are.  It does a recursive greedy search to try and match identical parts of the DNA together.

Levenshtein distance is also a solid choice.  It's O(n^2), so you don't want to be calling it for every DNA pair every frame, but it should be fast enough.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 09, 2014, 10:17:29 PM
Quote
I believe there's already something for sex repro to match up how similar two DNAs are.

There is, I wrote it (remember?). It is slow as hell though. How much more speed will I get from Levenshein? I will find that one out.

edit: Here is how it looks like in good old vb6 if anyone is interested  ;)

Code: [Select]
'si = start index, ei = end index, iinc = layer
Private Sub FindLongestSequences(ByRef rob1() As block2, ByRef rob2() As block2, si1 As Integer, ei1 As Integer, si2 As Integer, ei2 As Integer, ByVal timee As Long)
'Step1 What index range is smaller?
Dim searchlen As Integer
searchlen = ei1 - si1
If ei2 - si2 < searchlen Then searchlen = ei2 - si2
'Step2 Recrusivelly sweep from largest to shortest searchlen until match is found
Dim mylen As Integer
For mylen = (searchlen + 1) To 1 Step -1

If Timer - 32 > timee Then Exit Sub 'safe

    'Step2A The sweep itself
    Dim sweep1 As Integer
    Dim sweep2 As Integer
   
    For sweep1 = si1 To ei1 - (mylen - 1)
        For sweep2 = si2 To ei2 - (mylen - 1)
       
            'the match algo
            Dim lenloop As Integer
            Dim allmatch As Boolean 'are all values the same for this sweep?
            allmatch = True
            For lenloop = 0 To mylen - 1
                If rob1(lenloop + sweep1).tipo <> rob2(lenloop + sweep2).tipo Or rob1(lenloop + sweep1).value <> rob2(lenloop + sweep2).value Then
                    allmatch = False
                    Exit For
                End If
            Next
            If allmatch Then
            'match is found, goto step3
                iinc = iinc + 1
                For lenloop = 0 To mylen - 1
                    rob1(lenloop + sweep1).match = iinc
                    rob2(lenloop + sweep2).match = iinc
                Next
                GoTo step3
            End If
        Next
    Next
Next
Exit Sub
step3:
'find lefthand subsequance
If sweep1 > si1 And sweep2 > si2 Then FindLongestSequences rob1, rob2, si1, sweep1 - 1, si2, sweep2 - 1, timee
'find righthand subsequance
If sweep1 + (mylen - 1) < ei1 And sweep2 + (mylen - 1) < ei2 Then FindLongestSequences rob1, rob2, sweep1 + mylen, ei1, sweep2 + mylen, ei2, timee
End Sub
Title: Re: Need fastest possible code to compare dna
Post by: PhiNotPi on February 10, 2014, 08:08:44 PM
Look at the "possible improvements" section of the wiki article for the Wagner-Fischer algorithm (which computes the Levenshtein edit distance).  Wiki article: http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm (http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)

It says that we could potentially make substantial improvements to speed, depending on what we want.  For example, if we have an upper bound on the distance, then we can substantially reduce the amount of time.  Also, using lazy evaluation apparently gives a substantial speed improvement (details here, although I don't completely understand them: http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html).

On a somewhat unrelated note, if the DNA comes from robots of the same species (they are both mutated forms of the same original DNA) then we can use the mutations history of the robots to compare how related they are.  For instance, if robot A has X mutations and robot B has Y mutations, but the first Z number of mutations are the same for both, the edit distance could be a maximum of (X+Y-2*Z).  We would have to make some small adjustments so that "run" mutations would count as multiple edits. 

If the robots are closely related, then we can use that upper bound to reduce the time needed to calculate edit distance.

Assuming I read the wiki correctly, here is an example.  Let's say that robot A has 600 bp and 34 mutations (with run mutations counting as multiple mutations).  Robot B has 590 bp and 36 mutations. The first 15 mutations are the same.  Normally, the matrix would contain 600*590=354000 entries.  Since we have the upper bound of (34+36-2*15)=40, we can then use the modified algorithm to only evaluate part of the matrix.  In the end, we find that only 600*(2*40+1)=48600 entries need to be computed with the modification.  It is an order-of-magnitude speedup.
Title: Re: Need fastest possible code to compare dna
Post by: Botsareus on February 11, 2014, 12:01:47 PM
TY