Author Topic: Need fastest possible code to compare dna  (Read 1304 times)

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #15 on: February 07, 2014, 08:00:08 PM »
Oh gosh. I forgot that you were developing in the "ancient" VB6. Of course that makes more sense with the values in DB having a range of -32,768 to 32,767.

Luckily, "<< 16" can be replicated by multiplying by 2^16 (Using 16 instead of 32 because VB6 Integers are half the size of VB.NET Integers).

Code: [Select]
Dim nucleic As Long
nucleic = tipo * 2^16          ' Edit: You will still probably need to cast tipo as a long for this to work
nucleic = nucleic Or value

And just so that I'm clear, combining the tipo and value isn't required for the algorithm to work. You can also compare them individually and if either are different, increase the local edit distance.

I also remembered about the Damerau?Levenshtein distance which can handle transposition of adjacent elements. I don't think you'd want this in your particular case, but I thought I'd throw it out there.

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
    • DJ Paul Kononov
Re: Need fastest possible code to compare dna
« Reply #16 on: February 08, 2014, 11:08:31 AM »
I'll play with both versions and see which one is faster.

You never lose love in developing in your first programming language, especially as a hobby.  :P


Offline Numsgil

  • Administrator
  • Bot God
  • *****
  • Posts: 7704
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #17 on: February 08, 2014, 01:50:22 PM »
I believe there's already something for sex repro to match up how similar two DNAs are.  It does a recursive greedy search to try and match identical parts of the DNA together.

Levenshtein distance is also a solid choice.  It's O(n^2), so you don't want to be calling it for every DNA pair every frame, but it should be fast enough.

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
    • DJ Paul Kononov
Re: Need fastest possible code to compare dna
« Reply #18 on: February 09, 2014, 10:17:29 PM »
Quote
I believe there's already something for sex repro to match up how similar two DNAs are.

There is, I wrote it (remember?). It is slow as hell though. How much more speed will I get from Levenshein? I will find that one out.

edit: Here is how it looks like in good old vb6 if anyone is interested  ;)

Code: [Select]
'si = start index, ei = end index, iinc = layer
Private Sub FindLongestSequences(ByRef rob1() As block2, ByRef rob2() As block2, si1 As Integer, ei1 As Integer, si2 As Integer, ei2 As Integer, ByVal timee As Long)
'Step1 What index range is smaller?
Dim searchlen As Integer
searchlen = ei1 - si1
If ei2 - si2 < searchlen Then searchlen = ei2 - si2
'Step2 Recrusivelly sweep from largest to shortest searchlen until match is found
Dim mylen As Integer
For mylen = (searchlen + 1) To 1 Step -1

If Timer - 32 > timee Then Exit Sub 'safe

    'Step2A The sweep itself
    Dim sweep1 As Integer
    Dim sweep2 As Integer
   
    For sweep1 = si1 To ei1 - (mylen - 1)
        For sweep2 = si2 To ei2 - (mylen - 1)
       
            'the match algo
            Dim lenloop As Integer
            Dim allmatch As Boolean 'are all values the same for this sweep?
            allmatch = True
            For lenloop = 0 To mylen - 1
                If rob1(lenloop + sweep1).tipo <> rob2(lenloop + sweep2).tipo Or rob1(lenloop + sweep1).value <> rob2(lenloop + sweep2).value Then
                    allmatch = False
                    Exit For
                End If
            Next
            If allmatch Then
            'match is found, goto step3
                iinc = iinc + 1
                For lenloop = 0 To mylen - 1
                    rob1(lenloop + sweep1).match = iinc
                    rob2(lenloop + sweep2).match = iinc
                Next
                GoTo step3
            End If
        Next
    Next
Next
Exit Sub
step3:
'find lefthand subsequance
If sweep1 > si1 And sweep2 > si2 Then FindLongestSequences rob1, rob2, si1, sweep1 - 1, si2, sweep2 - 1, timee
'find righthand subsequance
If sweep1 + (mylen - 1) < ei1 And sweep2 + (mylen - 1) < ei2 Then FindLongestSequences rob1, rob2, sweep1 + mylen, ei1, sweep2 + mylen, ei2, timee
End Sub
« Last Edit: February 09, 2014, 10:39:01 PM by Botsareus »

Offline PhiNotPi

  • Bot Builder
  • **
  • Posts: 64
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #19 on: February 10, 2014, 08:08:44 PM »
Look at the "possible improvements" section of the wiki article for the Wagner-Fischer algorithm (which computes the Levenshtein edit distance).  Wiki article: http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

It says that we could potentially make substantial improvements to speed, depending on what we want.  For example, if we have an upper bound on the distance, then we can substantially reduce the amount of time.  Also, using lazy evaluation apparently gives a substantial speed improvement (details here, although I don't completely understand them: http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html).

On a somewhat unrelated note, if the DNA comes from robots of the same species (they are both mutated forms of the same original DNA) then we can use the mutations history of the robots to compare how related they are.  For instance, if robot A has X mutations and robot B has Y mutations, but the first Z number of mutations are the same for both, the edit distance could be a maximum of (X+Y-2*Z).  We would have to make some small adjustments so that "run" mutations would count as multiple edits. 

If the robots are closely related, then we can use that upper bound to reduce the time needed to calculate edit distance.

Assuming I read the wiki correctly, here is an example.  Let's say that robot A has 600 bp and 34 mutations (with run mutations counting as multiple mutations).  Robot B has 590 bp and 36 mutations. The first 15 mutations are the same.  Normally, the matrix would contain 600*590=354000 entries.  Since we have the upper bound of (34+36-2*15)=40, we can then use the modified algorithm to only evaluate part of the matrix.  In the end, we find that only 600*(2*40+1)=48600 entries need to be computed with the modification.  It is an order-of-magnitude speedup.
« Last Edit: February 10, 2014, 08:11:08 PM by PhiNotPi »
I am biased neither towards nor against any single mathematical constant.

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
    • DJ Paul Kononov
Re: Need fastest possible code to compare dna
« Reply #20 on: February 11, 2014, 12:01:47 PM »
TY