Parsing RTF 
Author Message
 Parsing RTF

I'm interested in writing a small RTF parser in Perl.  Given rtf text,  
like that shown below, what is the best way to extract tokens from the  
text?

{\rtf0\ansi{\fonttbl\f0\fswiss Helvetica;}
\margl120
\margr120
{{\attachment0 telephonedirectory2.wp

Quote:
}

\pard\tx533\tx1067\tx1601\tx2135\tx2668\tx3202\tx3736\tx4270\tx4803\tx5337
\f0\b0\i0\ul0\fs36 This is the body of the message.

The keywords like \rtf and tx (tab settings) are followed by numbers, and  
as can be seen, keywords don't need to be seperated by spaces.   At the  
moment, I just want to extract out the keywords, but later I anticipate  
wanting to do more.

For those that don't know anything about RTF(Rich Text Format), all  
keywords begin with a \ and groups of keywords(like stylesheets) are  
enclose in {}.

-Mike



Sat, 04 Mar 1995 11:52:14 GMT  
 Parsing RTF

Quote:

>I'm interested in writing a small RTF parser in Perl.  Given rtf text,  
>like that shown below, what is the best way to extract tokens from the  
>text?

Here's some perl code that parses RTF and writes stuff that a lisp
parser could digest.

#!/usr/local/bin/perl
#

while(<>){
    while($_ ne ''){
        if(s/^\{//){    # open {
            print "( ";
        }elsif(s/^\}//){ # close }
            print ")\n";
        }elsif(s/^\\//){ # control sequence
            if(s/^([a-zA-Z]+)(-?[0-9]*) ?//){ # control word
                if($2 ne ''){ # with parameter
                    print "($1 $2) ";
                }else{
                    print $1, " ";
                }
            }else{ # special control sequence
                if(s/^\'//){ # hex encoded char
                    s/..//;
                    print "#x$& ";
                }elsif(s/^[:{}\\]//){ # single char escape
                    print &lisp_string($&);
                }elsif(s/^\|//){
                    print "rtfFormula ";
                }elsif(s/^\~//){
                    print "rtfNoBrkSpace ";
                }elsif(s/^\_//){
                    print "rtfNoReqHyphen ";
                }elsif(s/^[\n\r]//){
                    print "par ";
                }elsif(s/^\*//){
                    print "rtfOptDest ";
                }else{
                    s/.//;
                    warn "look this one up: $& ", ord($&);
                }
            }
        }else{
            if(s/^\t//){
                print "TAB ";
            }else{
                s/^[^\t\\{}]+// && print &lisp_string($&);
            }
        }
    }

Quote:
}

sub lisp_string{

    s/\n//g;
    return '' if $_ eq '';
    s/\"/\\\"/g;
    return '"' . $_ . '" ';

Quote:
}



Sun, 05 Mar 1995 02:22:41 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. RTF to HTML, RTF to Text, HTML to RTF conversion

2. RTF parsing

3. Software to Parse MS RTF?

4. text to RTF file converting using Perl

5. RTF processing or general text processing module

6. RTF commands imbedded in FORMAT statements...on NeXTSTEP

7. MIF to RTF

8. Microsoft RTF Decoder Wanted

9. man-rtf converter script

10. RTF Module?

11. Wanted: perl RTF (helpfile format) parser

12. RTF parser in Perl?

 

 
Powered by phpBB® Forum Software