Algorithm for Text Subsets 
Author Message
 Algorithm for Text Subsets

I have a portion of a large Perl program where I have a hash with
multiple values associated with each key. Lets say I have the following
values associated with a key called "Gene":

1. actgattgaacca
2. gattg
3. actggtaaccggttaacctt

The number of values varies from key to key, by the way.

What I am trying to do with the above group is to eliminate any line of
text whose pattern is contained in another. In the above example line
number 2 would be removed because its pattern is found in the middle of
line 1.

The problem I'm having is figuring our how you compare each line to
every other line, move to the next line, compare with all the rest,
etc., etc...

Thanks for the help.

-Dan



Fri, 09 Sep 2005 22:15:50 GMT  
 Algorithm for Text Subsets

Quote:
> I have a portion of a large Perl program where I have a hash with
> multiple values associated with each key. Lets say I have the following
> values associated with a key called "Gene":

> 1. actgattgaacca
> 2. gattg
> 3. actggtaaccggttaacctt

> The number of values varies from key to key, by the way.

> What I am trying to do with the above group is to eliminate any line of
> text whose pattern is contained in another. In the above example line
> number 2 would be removed because its pattern is found in the middle of
> line 1.

> The problem I'm having is figuring our how you compare each line to
> every other line, move to the next line, compare with all the rest,
> etc., etc...

Do it the other way round.  First sort the strings descending by length.
Start with an empty set of strings to keep.  For each original string,
see if it is contained in one of the strings you have already collected.
If it isn't, add it to the collection, else drop it.  In code:


        qw( actgattgaacca gattg actggtaaccggttaacctt);




    }

Note that index() returns -1 if the substring can't be found.

Anno



Fri, 09 Sep 2005 22:54:33 GMT  
 Algorithm for Text Subsets
Thanks! I will give it a go tomorrow.

Your rapid response is quite appreciated
-Dan


Quote:

> > I have a portion of a large Perl program where I have a hash with
> > multiple values associated with each key. Lets say I have the following
> > values associated with a key called "Gene":

> > 1. actgattgaacca
> > 2. gattg
> > 3. actggtaaccggttaacctt

> > The number of values varies from key to key, by the way.

> > What I am trying to do with the above group is to eliminate any line of
> > text whose pattern is contained in another. In the above example line
> > number 2 would be removed because its pattern is found in the middle of
> > line 1.

> > The problem I'm having is figuring our how you compare each line to
> > every other line, move to the next line, compare with all the rest,
> > etc., etc...

> Do it the other way round.  First sort the strings descending by length.
> Start with an empty set of strings to keep.  For each original string,
> see if it is contained in one of the strings you have already collected.
> If it isn't, add it to the collection, else drop it.  In code:


>         qw( actgattgaacca gattg actggtaaccggttaacctt);




>     }

> Note that index() returns -1 if the substring can't be found.

> Anno



Sat, 10 Sep 2005 03:03:55 GMT  
 Algorithm for Text Subsets

Quote:

> I have a portion of a large Perl program where I have a hash with
> multiple values associated with each key. Lets say I have the
> following values associated with a key called "Gene":

> 1. actgattgaacca
> 2. gattg
> 3. actggtaaccggttaacctt

> The number of values varies from key to key, by the way.

> What I am trying to do with the above group is to eliminate any line
> of text whose pattern is contained in another. In the above example
> line number 2 would be removed because its pattern is found in the
> middle of line 1.

> The problem I'm having is figuring our how you compare each line to
> every other line, move to the next line, compare with all the rest,
> etc., etc...

If you don't mind the order being rearranged, then code like the
following should work:

while( my ($key, $value) = each %bighash ) {
   my $keep = "";

      ($keep .= $val) .= " " if index($keep, $val) == -1;
   }

Quote:
}

If you need the kept items to be in the same order as they originally
were in (other than not having the now eliminated items), then try
something like this:

while( my ($key, $value) = each %bighash ) {
   my $keep = "";

   foreach my $index (
      sort { length $$values[$b] <=> length $$values[$a] }
      0 .. $#$values ) {
      if( index( $keep, $$values[$index] ) == -1 ) {
         ($keep .= $$values[$index]) .= " ";

      }
   }

Quote:
}

[both untested]

--





Sat, 10 Sep 2005 03:51:13 GMT  
 Algorithm for Text Subsets
I'm wondering if this code works if two strings are the same length, yet
contain different characters...

-Dan

Quote:


> > I have a portion of a large Perl program where I have a hash with
> > multiple values associated with each key. Lets say I have the following
> > values associated with a key called "Gene":

> > 1. actgattgaacca
> > 2. gattg
> > 3. actggtaaccggttaacctt

> > The number of values varies from key to key, by the way.

> > What I am trying to do with the above group is to eliminate any line of
> > text whose pattern is contained in another. In the above example line
> > number 2 would be removed because its pattern is found in the middle of
> > line 1.

> > The problem I'm having is figuring our how you compare each line to
> > every other line, move to the next line, compare with all the rest,
> > etc., etc...

> Do it the other way round.  First sort the strings descending by length.
> Start with an empty set of strings to keep.  For each original string,
> see if it is contained in one of the strings you have already collected.
> If it isn't, add it to the collection, else drop it.  In code:


>         qw( actgattgaacca gattg actggtaaccggttaacctt);




>     }

> Note that index() returns -1 if the substring can't be found.

> Anno



Sat, 10 Sep 2005 16:07:42 GMT  
 Algorithm for Text Subsets
Unfortunately, the code below does not work properly. I've got the
sorting of the keys in descending length to work, but neither Anno's or
your algorithm for removing a string pattern that is found in another
string works.

I'll keep at it...
-Dan

Quote:


> > I have a portion of a large Perl program where I have a hash with
> > multiple values associated with each key. Lets say I have the
> > following values associated with a key called "Gene":

> > 1. actgattgaacca
> > 2. gattg
> > 3. actggtaaccggttaacctt

> > The number of values varies from key to key, by the way.

> > What I am trying to do with the above group is to eliminate any line
> > of text whose pattern is contained in another. In the above example
> > line number 2 would be removed because its pattern is found in the
> > middle of line 1.

> > The problem I'm having is figuring our how you compare each line to
> > every other line, move to the next line, compare with all the rest,
> > etc., etc...

> If you don't mind the order being rearranged, then code like the
> following should work:

> while( my ($key, $value) = each %bighash ) {
>    my $keep = "";

>       ($keep .= $val) .= " " if index($keep, $val) == -1;
>    }

> }

> If you need the kept items to be in the same order as they originally
> were in (other than not having the now eliminated items), then try
> something like this:

> while( my ($key, $value) = each %bighash ) {
>    my $keep = "";

>    foreach my $index (
>       sort { length $$values[$b] <=> length $$values[$a] }
>       0 .. $#$values ) {
>       if( index( $keep, $$values[$index] ) == -1 ) {
>          ($keep .= $$values[$index]) .= " ";

>       }
>    }

> }

> [both untested]

> --






Sat, 10 Sep 2005 18:14:15 GMT  
 Algorithm for Text Subsets
A thousand pardons, Benjamin. There was simply a typographical error in
your code.

It works quite fine right now!
-Dan

Quote:

> Unfortunately, the code below does not work properly. I've got the
> sorting of the keys in descending length to work, but neither Anno's or
> your algorithm for removing a string pattern that is found in another
> string works.

> I'll keep at it...
> -Dan



> > > I have a portion of a large Perl program where I have a hash with
> > > multiple values associated with each key. Lets say I have the
> > > following values associated with a key called "Gene":

> > > 1. actgattgaacca
> > > 2. gattg
> > > 3. actggtaaccggttaacctt

> > > The number of values varies from key to key, by the way.

> > > What I am trying to do with the above group is to eliminate any line
> > > of text whose pattern is contained in another. In the above example
> > > line number 2 would be removed because its pattern is found in the
> > > middle of line 1.

> > > The problem I'm having is figuring our how you compare each line to
> > > every other line, move to the next line, compare with all the rest,
> > > etc., etc...

> > If you don't mind the order being rearranged, then code like the
> > following should work:

> > while( my ($key, $value) = each %bighash ) {
> >    my $keep = "";

> >       ($keep .= $val) .= " " if index($keep, $val) == -1;
> >    }

> > }

> > If you need the kept items to be in the same order as they originally
> > were in (other than not having the now eliminated items), then try
> > something like this:

> > while( my ($key, $value) = each %bighash ) {
> >    my $keep = "";

> >    foreach my $index (
> >       sort { length $$values[$b] <=> length $$values[$a] }
> >       0 .. $#$values ) {
> >       if( index( $keep, $$values[$index] ) == -1 ) {
> >          ($keep .= $$values[$index]) .= " ";

> >       }
> >    }

> > }

> > [both untested]

> > --






Sat, 10 Sep 2005 18:23:57 GMT  
 Algorithm for Text Subsets

Quote:
> I'm wondering if this code works if two strings are the same length, yet
> contain different characters...

What code?  If you have a question, please put it in context.

Quote:
> -Dan



[TOFU snipped]

Anno



Sat, 10 Sep 2005 20:21:11 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Need Search Engine like algorithm to scan text

2. algorithm to htmlify text/plain-list ?

3. REQ: algorithm for DES::encrypt/decrypt of text strings

4. Small subset of Perl....

5. make subset of the search result

6. Perl tuning/speed question: all subsets of a set of length k -- keeping the inital order

7. Simple Subset Question

8. newbie Q:create a subset

9. alphabetical subset of hash keys

10. all unique subsets of a list

11. Subset Question

12. Spacing algorithm

 

 
Powered by phpBB® Forum Software