Author Topic: Need fastest possible code to compare dna  (Read 9488 times)

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Need fastest possible code to compare dna
« on: February 07, 2014, 02:08:30 PM »
I am currently using the original DNA structure:
Code: [Select]
  DNA() As block            ' program array
...
Type block
  tipo As Integer
  value As Integer
End Type

All my attempts to write a fast algorithm in vb6 to compare and calculate how different two DNAs are where futile.

I need a fast algorithm to compare two DNAs. If it has to be low level, that is fine. As long as I can call an API to it. This kind of DLL may be useful for DB3 as well, that is why I posted here.

Also, if Percentage to the nearest Whole Number is faster the Floating Point, please do Percentage. That is fine.

Thank you Numsgil, I hope you can pick your brain from your awesome video game programming experience and pull out something for me.

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #1 on: February 07, 2014, 02:19:43 PM »
My first thought is some kind of modified two character Levenshtein distance. I can expand on this process if you tell me more about what you consider similar DNA.

1) Should the block be considered distinct if just the value is different and the tipo is the same.
2) Should the block be considered distinct if just the tipo is different and the value is the same.
3) Should the block be considered distinct if either the tipo or the value is the same edit: different.

The nice thing about the Levenshtein distance algorithm is you should be able to find a very efficient solution in almost any language.
« Last Edit: February 07, 2014, 02:50:41 PM by JRaccoon »

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #2 on: February 07, 2014, 02:44:34 PM »
The block should be considered distinct if either the tipo or the value is different. A change in any of these data points constitutes a different dna statement. There could be runs of identical DNA and runs of different DNA. I attempted to port windiff logic as much as possible just by guessing.
« Last Edit: February 07, 2014, 02:49:06 PM by Botsareus »

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #3 on: February 07, 2014, 03:10:42 PM »
Okay, that's what I would have guessed. Instead of Strings (which can be thought of as character arrays), you'll have the two complete DNA sequences as your strings. The only major modification you should have to make to the standard algorithm is instead of comparing a single character, you'll have to compare two Integers and only consider that block (character) equivalent if both tipo and value are the same.

Another option, if you have access to double-wide data types, would be to store both the tipo (What does this stand for?) and the value in a single Double (or whatever it might be called) and then you would only have to compare a single data point. This approach would likely be less efficient though.

I'd offer up some pseudo-code, but I don't really know Visual Basic which is the language of choice for DarwinBots if memory (and your code sample) serves. For the standard String comparison though, Rosetta Code has an example. http://rosettacode.org/wiki/Levenshtein_distance#Visual_Basic_.NET
I can't really comment on the efficiency of that implementation but it might be enough to get you started.
« Last Edit: February 07, 2014, 03:13:53 PM by JRaccoon »

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #4 on: February 07, 2014, 03:16:57 PM »
Quote
tipo (What does this stand for?)

We brake up dna into command types. Some examples are:

*.sysvar, .sysvar, basic command( add, subtract, etc.), advanced command (angle, distance, etc.)



I wonder if there is any low level tricks to make it faster though.
Also, the code returns a matrix of strings not a ratio.
Also, why are we doing "Math.Min" sounds like data modification to me.

Thanks for your help btw.

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #5 on: February 07, 2014, 03:21:13 PM »
So that's just Spanish for type then? I'm guessing that dates all the way back to Carlo's original code.

I was trying to interpret it as some special abbreviation, and couldn't come up with anything.

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #6 on: February 07, 2014, 03:26:12 PM »
Italian actually  :P

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #7 on: February 07, 2014, 03:34:34 PM »
Anyway, I am going to read up on Levenshtein distance on Wikipedia. If I still do not know what to do with the resulting matrix after that, I'll be asking.
I still want to get Numsgil view on these though.

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #8 on: February 07, 2014, 05:01:13 PM »
A couple of points about the edit distance produced by the Levenshtein algorithm:
  • It will find the distance to transform a string not only with substitutions, but also insertions and deletions. The edit distance between ACG and ATCG is 1 because one insertion is required for the first sequence. I don't know if that is the desired functionality for you.
  • The final result of the computation is a matrix that contains all of the edit distances for each of the substrings. This allows you at the end to find the exact path that generated the complete shortest distance. The final edit distance will be found at I_MAX, J_MAX where those are the lengths of the arrays. You may not need this extra functionality, which means there are optimizations for the space complexity of the algorithm.

I looked up and there is a double-wide integral type for Visual Basic (LONG and ULONG), so my suggestion to simplify the algorithm (with at least a small hit to performance) stands as an alternative. Each block could become a single value for comparison (I have virtually no experience with Visual Basic so I apologize if the code below is gibberish):
Code: [Select]
Dim nucleic As Long
nucleic = tipo << 32          ' Edit: You will probably need to cast tipo as a long for this to work, based on a cursory glance at the MSDN pages
nucleic = nucleic Or value
« Last Edit: February 07, 2014, 05:08:13 PM by JRaccoon »

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #9 on: February 07, 2014, 05:11:36 PM »
<< is not supported in vb



See picky. I kinda get what I am looking at. What I don't understand is what to do with this data to get a ratio of how different two strings are. Where 100% will be the same and 0% completely different (as an example.)  I also suspect there is optimization that can be done based on the type of result I need. Thx.


edit: Had to remember to actually upload my picky

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #10 on: February 07, 2014, 05:24:59 PM »
I highlighted the final edit distance in the picture. To convert to a percent difference, I think the most intuitive way would be to run through the algorithm and then do: percentage = edit_distance/size_of_longest_array * 100 percentage = (1 - edit_distance/size_of_longest_array) * 100

So in the kitten:sitting example you'd get:
Code: [Select]
percentage = (1 - 3/7)*100 = ~58%
And in the Saturday:Sunday example you'd get:
Code: [Select]
percentage = (1 - 3/8)*100 = ~63%
Edit: Fixed the example to correspond to your request. Percentage now represents similarity rather than difference.
« Last Edit: February 07, 2014, 05:31:58 PM by JRaccoon »

Offline JRaccoon

  • Bot Neophyte
  • *
  • Posts: 10
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #11 on: February 07, 2014, 05:34:50 PM »
<< is not supported in vb

This MSDN page told me that it is present in all versions of VB dating back to VS2005. I guess you can't always trust what you read on the internet.  :blink:
« Last Edit: February 07, 2014, 05:37:09 PM by JRaccoon »

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #12 on: February 07, 2014, 05:54:54 PM »
Wow, this is really cool. I will have to play with it later. Thank you for your help.

Offline Peter

  • Bot God
  • *****
  • Posts: 1177
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #13 on: February 07, 2014, 05:55:46 PM »
Isn't VB6 older than that.  :P
Oh my god, who the hell cares.

Offline Botsareus

  • Society makes it all backwards - there is a good reason for that
  • Bot God
  • *****
  • Posts: 4483
    • View Profile
Re: Need fastest possible code to compare dna
« Reply #14 on: February 07, 2014, 06:00:08 PM »
1998. But don't forget that as Testlund pointed out, it took us 10 years to reach alpha.