### Author Topic: Need fastest possible code to compare dna  (Read 2320 times)

#### JRaccoon

• Bot Neophyte
• Posts: 10
##### Re: Need fastest possible code to compare dna
« Reply #15 on: February 07, 2014, 08:00:08 PM »
Oh gosh. I forgot that you were developing in the "ancient" VB6. Of course that makes more sense with the values in DB having a range of -32,768 to 32,767.

Luckily, "<< 16" can be replicated by multiplying by 2^16 (Using 16 instead of 32 because VB6 Integers are half the size of VB.NET Integers).

Code: [Select]
`Dim nucleic As Longnucleic = tipo * 2^16          ' Edit: You will still probably need to cast tipo as a long for this to worknucleic = nucleic Or value `
And just so that I'm clear, combining the tipo and value isn't required for the algorithm to work. You can also compare them individually and if either are different, increase the local edit distance.

I also remembered about the Damerau?Levenshtein distance which can handle transposition of adjacent elements. I don't think you'd want this in your particular case, but I thought I'd throw it out there.

#### Botsareus

• Society makes it all backwards - there is a good reason for that
• Bot God
• Posts: 4483
##### Re: Need fastest possible code to compare dna
« Reply #16 on: February 08, 2014, 11:08:31 AM »
I'll play with both versions and see which one is faster.

You never lose love in developing in your first programming language, especially as a hobby.

#### Numsgil

• Bot God
• Posts: 7715
##### Re: Need fastest possible code to compare dna
« Reply #17 on: February 08, 2014, 01:50:22 PM »
I believe there's already something for sex repro to match up how similar two DNAs are.  It does a recursive greedy search to try and match identical parts of the DNA together.

Levenshtein distance is also a solid choice.  It's O(n^2), so you don't want to be calling it for every DNA pair every frame, but it should be fast enough.

#### Botsareus

• Society makes it all backwards - there is a good reason for that
• Bot God
• Posts: 4483
##### Re: Need fastest possible code to compare dna
« Reply #18 on: February 09, 2014, 10:17:29 PM »
Quote
I believe there's already something for sex repro to match up how similar two DNAs are.

There is, I wrote it (remember?). It is slow as hell though. How much more speed will I get from Levenshein? I will find that one out.

edit: Here is how it looks like in good old vb6 if anyone is interested

Code: [Select]
`'si = start index, ei = end index, iinc = layerPrivate Sub FindLongestSequences(ByRef rob1() As block2, ByRef rob2() As block2, si1 As Integer, ei1 As Integer, si2 As Integer, ei2 As Integer, ByVal timee As Long)'Step1 What index range is smaller?Dim searchlen As Integersearchlen = ei1 - si1If ei2 - si2 < searchlen Then searchlen = ei2 - si2'Step2 Recrusivelly sweep from largest to shortest searchlen until match is foundDim mylen As IntegerFor mylen = (searchlen + 1) To 1 Step -1If Timer - 32 > timee Then Exit Sub 'safe    'Step2A The sweep itself    Dim sweep1 As Integer    Dim sweep2 As Integer        For sweep1 = si1 To ei1 - (mylen - 1)        For sweep2 = si2 To ei2 - (mylen - 1)                    'the match algo            Dim lenloop As Integer            Dim allmatch As Boolean 'are all values the same for this sweep?            allmatch = True            For lenloop = 0 To mylen - 1                If rob1(lenloop + sweep1).tipo <> rob2(lenloop + sweep2).tipo Or rob1(lenloop + sweep1).value <> rob2(lenloop + sweep2).value Then                    allmatch = False                    Exit For                End If            Next            If allmatch Then            'match is found, goto step3                iinc = iinc + 1                For lenloop = 0 To mylen - 1                    rob1(lenloop + sweep1).match = iinc                    rob2(lenloop + sweep2).match = iinc                Next                GoTo step3            End If        Next    NextNextExit Substep3:'find lefthand subsequanceIf sweep1 > si1 And sweep2 > si2 Then FindLongestSequences rob1, rob2, si1, sweep1 - 1, si2, sweep2 - 1, timee'find righthand subsequanceIf sweep1 + (mylen - 1) < ei1 And sweep2 + (mylen - 1) < ei2 Then FindLongestSequences rob1, rob2, sweep1 + mylen, ei1, sweep2 + mylen, ei2, timeeEnd Sub`
« Last Edit: February 09, 2014, 10:39:01 PM by Botsareus »

#### PhiNotPi

• Bot Builder
• Posts: 64
##### Re: Need fastest possible code to compare dna
« Reply #19 on: February 10, 2014, 08:08:44 PM »
Look at the "possible improvements" section of the wiki article for the Wagner-Fischer algorithm (which computes the Levenshtein edit distance).  Wiki article: http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

It says that we could potentially make substantial improvements to speed, depending on what we want.  For example, if we have an upper bound on the distance, then we can substantially reduce the amount of time.  Also, using lazy evaluation apparently gives a substantial speed improvement (details here, although I don't completely understand them: http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html).

On a somewhat unrelated note, if the DNA comes from robots of the same species (they are both mutated forms of the same original DNA) then we can use the mutations history of the robots to compare how related they are.  For instance, if robot A has X mutations and robot B has Y mutations, but the first Z number of mutations are the same for both, the edit distance could be a maximum of (X+Y-2*Z).  We would have to make some small adjustments so that "run" mutations would count as multiple edits.

If the robots are closely related, then we can use that upper bound to reduce the time needed to calculate edit distance.

Assuming I read the wiki correctly, here is an example.  Let's say that robot A has 600 bp and 34 mutations (with run mutations counting as multiple mutations).  Robot B has 590 bp and 36 mutations. The first 15 mutations are the same.  Normally, the matrix would contain 600*590=354000 entries.  Since we have the upper bound of (34+36-2*15)=40, we can then use the modified algorithm to only evaluate part of the matrix.  In the end, we find that only 600*(2*40+1)=48600 entries need to be computed with the modification.  It is an order-of-magnitude speedup.
« Last Edit: February 10, 2014, 08:11:08 PM by PhiNotPi »
I am biased neither towards nor against any single mathematical constant.

#### Botsareus

• Society makes it all backwards - there is a good reason for that
• Bot God
• Posts: 4483
##### Re: Need fastest possible code to compare dna
« Reply #20 on: February 11, 2014, 12:01:47 PM »
TY