Schwarzian ways for your Suffix arrays 
Author Message
 Schwarzian ways for your Suffix arrays

A "suffix array" of a string is the sorted list of its suffixes.
They're extremely useful for searches (I see "binary search for camel
in a suffix array of _Hamlet_" in my mind's eye...).  [Manber, Myers]
give serious (efficient) algorithms for building suffix arrays.

Here's the trivial (and *extremely* inefficient) way to do it in Perl,
courtesy of the Schwarzian transform.  The "map" is only needed to
print it nicely, of course (storing the actual suffixes is *not* done
in practical applications).

<lindt 102 [10:08] ~ >perl -de 42

$ perl -de 42

Loading DB routines from perl5db.pl version 1.01
Emacs support available.

Enter h or `h h' for help.

main::(-e:1):   42
  DB<1> $t = "the sixth sick sheikh's sixth sheep's sick"

  DB<2> x map { substr $t, $_ } (sort { (substr $t,$a) cmp (substr $t,$b) }  (0..length($t)-1))
0  ' sheep\'s sick'
1  ' sheikh\'s sixth sheep\'s sick'
2  ' sick'
3  ' sick sheikh\'s sixth sheep\'s sick'
4  ' sixth sheep\'s sick'
5  ' sixth sick sheikh\'s sixth sheep\'s sick'
6  '\'s sick'
7  '\'s sixth sheep\'s sick'
8  'ck'
9  'ck sheikh\'s sixth sheep\'s sick'
10  'e sixth sick sheikh\'s sixth sheep\'s sick'
11  'eep\'s sick'
12  'eikh\'s sixth sheep\'s sick'
13  'ep\'s sick'
14  'h sheep\'s sick'
15  'h sick sheikh\'s sixth sheep\'s sick'
16  'h\'s sixth sheep\'s sick'
17  'he sixth sick sheikh\'s sixth sheep\'s sick'
18  'heep\'s sick'
19  'heikh\'s sixth sheep\'s sick'
20  'ick'
21  'ick sheikh\'s sixth sheep\'s sick'
22  'ikh\'s sixth sheep\'s sick'
23  'ixth sheep\'s sick'
24  'ixth sick sheikh\'s sixth sheep\'s sick'
25  'k'
26  'k sheikh\'s sixth sheep\'s sick'
27  'kh\'s sixth sheep\'s sick'
28  'p\'s sick'
29  's sick'
30  's sixth sheep\'s sick'
31  'sheep\'s sick'
32  'sheikh\'s sixth sheep\'s sick'
33  'sick'
34  'sick sheikh\'s sixth sheep\'s sick'
35  'sixth sheep\'s sick'
36  'sixth sick sheikh\'s sixth sheep\'s sick'
37  'th sheep\'s sick'
38  'th sick sheikh\'s sixth sheep\'s sick'
39  'the sixth sick sheikh\'s sixth sheep\'s sick'
40  'xth sheep\'s sick'
41  'xth sick sheikh\'s sixth sheep\'s sick'

References: [Manber, Myers] Udi Manber & Gene Myers, _Suffix arrays: A
new method for on-line string searches_, available from
http://www.*-*-*.com/

--

Compugen Ltd.          |Tel: +972-2-6795059 (Jerusalem) \  NEW IMPROVED URL!
72 Pinhas Rosen St.    |Tel: +972-3-7658520 (Main office)`--------------------
Tel-Aviv 69512, ISRAEL |Fax: +972-3-7658555   http://www.*-*-*.com/ ~ariels



Sun, 11 Nov 2001 03:00:00 GMT  
 Schwarzian ways for your Suffix arrays

[Posted and Mailed]

Quote:
> Here's the trivial (and *extremely* inefficient) way to do it in Perl,
> courtesy of the Schwarzian transform.  

Not only is the Schwartzian Transform inappropriate here, but you
didn't use it anyway.

You are supposed to use the Schwartzian Transform (ST) when you have a
list of data that you want sorted by some criterion which should *not*
appear in the result.  Here you are sorting a list of numbers
according to the alphabetic order of the suffixes that they represent.
If you wanted to end up with a list of numbers, it would make sense to
use the ST.  But if you want to end up with a list of suffixes instead
of a list of numbers, it is faster and simpler to just generate the
list of strings and sort that directly.  Instead of your example:

Quote:
>   DB<2> x map { substr $t, $_ } (sort { (substr $t,$a) cmp (substr $t,$b) }  
>               (0..length($t)-1))

Just do the obvious thing and say

            sort map { substr $t, $_ } (0..length($t)-1));

Your version makes 382 calls to substr instead of 42.  

If you wanted to end up with a list of numbers instead of a list of
suffixes, the ST would make sense, but you didn't use it.  The point
of the ST is to avoid those 382 calls to `substr'.  An ST version of
this sort will avoid calling `substr' 382 times, by generating a list
of items like these:

        ['the sixth sick sheikh\'s sixth sheep\'s sick', 0]
        ['he sixth sick sheikh\'s sixth sheep\'s sick', 1]
        ...
        ['sick', 38]
        ['ick', 39]
        ['ck', 40]
        ['k', 41]

Then it sorts them by the string parts, throws the string parts away,
and keeps the numbers:

        map { $_[1] }
          sort { $a->[0] <=> $b->[0] }
            map { [ substr($t, $_), $_  ] }
              (0..length($t)-1);

I have a short article about this at

        http://www.plover.com/~mjd/perl/TPC/1998/Hardware-notes.html#Schwartz...



Sun, 11 Nov 2001 03:00:00 GMT  
 Schwarzian ways for your Suffix arrays

Quote:

> [Posted and Mailed]
> > Here's the trivial (and *extremely* inefficient) way to do it in Perl,
> > courtesy of the Schwarzian transform.  

> Not only is the Schwartzian Transform inappropriate here, but you
> didn't use it anyway.

This is absolutely correct, and I should apologise for posting with
what mind I may have disengaged.

To my defence I can only say that I also got the name wrong, so I
didn't tarnish Randall Schwartz's reputation.

Quote:
> [explanation of The True Way snipped]

--

Compugen Ltd.          |Tel: +972-2-6795059 (Jerusalem) \  NEW IMPROVED URL!
72 Pinhas Rosen St.    |Tel: +972-3-7658520 (Main office)`--------------------
Tel-Aviv 69512, ISRAEL |Fax: +972-3-7658555  http://3w.compugen.co.il/~ariels


Mon, 12 Nov 2001 03:00:00 GMT  
 Schwarzian ways for your Suffix arrays

# To my defence I can only say that I also got the name wrong, so I
# didn't tarnish Randall Schwartz's reputation.

That's OK, you got his first name wrong here, too.  :)

--

%PGPKey = ('B76E72AD', [1024, '0824090B CE73CA10  1FF77F13 8180B6B6'])



Mon, 12 Nov 2001 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. suffix of associate array's file.

2. Different ways of initialising a two dimensional array

3. Safe Ways of Passing Arrays When Using Strict Local Variables

4. Perl suffix tree module/function?

5. proper suffix for perl program names

6. Problem: IIS 4 doesn't pass URL path suffix to scripts

7. Date Suffix Routine - 1st, 2nd, 3rd...

8. Unfamiliar with .z suffix files...

9. file type suffix explanation docs?

10. is .pl suffix legit ?

11. Seeking advice on semantics (pros / cons and best ways of) using constants across modules

12. fast ways to count substrings

 

 
Powered by phpBB® Forum Software