Text Parsing - Parse::RecDescent or another method? 
Author Message
 Text Parsing - Parse::RecDescent or another method?

I've got a Perl Journal Tutorial on the Parse::RecDescent module in front of
me, but it's not making much sense right now. I'd like some advice on how
best to proceed.

I'm trying to take a file in the format:

BEGIN:VCARD
VERSION:2.1
N:Pryce, Chris
TEL;PREF;WORK: (763) xxx-xxxx
FN:Pryce;Chris
ADR;PREF;WORK:;;1234 Elm Street\nApt. 2;Smalltown;MN;55555
TEL;CELL: (612) xxx-xxxx
END:VCARD

There would be one or more vCards per file.

I'd like to end up with a array of hashrefs.

$card[0] = (
            {ADR => {
                        WORK => {   #deleting the PREF...
                                   street1=>"1234 Elm Street",
                                   street2=>"Apt. 2"
                                   city=>"Smalltown",
                                   State=>"MN",
                                   PostalCode=>"55555"
                                 }
                    }
             },
            {TEL => {
                        WORK=>"(763) 475-0010", #deleting the PREF...
                        CELL=>"(612) 790-1965"
                    }
            },

Etc...

The parse must keep white space (except for new lines) on the right side of
the colon intact.

I was thinking a RecDescent parser would be the best, but I've no idea where
to start. Is this the best approach? Or would a series of regular
expressions be better?

I searched CPAN for a module to parse vCards but didn1t find one. Is there
one that I missed?

Thanks,

cp



Fri, 20 Aug 2004 20:34:06 GMT  
 Text Parsing - Parse::RecDescent or another method?

Quote:

>I've got a Perl Journal Tutorial on the Parse::RecDescent module in front of
>me, but it's not making much sense right now. I'd like some advice on how
>best to proceed.

>I'm trying to take a file in the format:

>BEGIN:VCARD
>VERSION:2.1
>N:Pryce, Chris
>TEL;PREF;WORK: (763) xxx-xxxx
>FN:Pryce;Chris
>ADR;PREF;WORK:;;1234 Elm Street\nApt. 2;Smalltown;MN;55555
>TEL;CELL: (612) xxx-xxxx
>END:VCARD

That format looks too simple to me to warrant a
multiline/recursive-capable parser.

Quote:
>There would be one or more vCards per file.

Even then.

Quote:
>I'd like to end up with a array of hashrefs.

>$card[0] = (
>            {ADR => {
>                        WORK => {   #deleting the PREF...
>                                   street1=>"1234 Elm Street",
>                                   street2=>"Apt. 2"
>                                   city=>"Smalltown",
>                                   State=>"MN",
>                                   PostalCode=>"55555"
>                                 }
>                    }
>             },
>            {TEL => {
>                        WORK=>"(763) 475-0010", #deleting the PREF...
>                        CELL=>"(612) 790-1965"
>                    }
>            },

>Etc...

Wow. Hold it. Make up your mind. $card[0] cannot contain two hashes each
with just one item.

I think

        $card[0] = { ADR => { ... }, TEL => ..., N => ..., ... }

will do.

Quote:
>The parse must keep white space (except for new lines) on the right side of
>the colon intact.

No problem.

Quote:
>I was thinking a RecDescent parser would be the best, but I've no idea where
>to start. Is this the best approach? Or would a series of regular
>expressions be better?

I think I'd go for line by line processing. Even split might do.

The next code appears to produce a result close to what you ask for:


        while(<>) {
            if ((my $new=/^BEGIN:VCARD$/) .. /^END:VCARD$/) {

                chomp;
                my($name, $value)= split /:/, $_, 2;
                next if $name =~ /^(?:BEGIN|END|VERSION)$/;

                my $p = $record;


                    $p = $p->{$_} ||= {};
                }

                    my %r;

                      split /;/, $value, -1;
                    $p->{$key} = \%r;
                } elsif($key eq 'FN') {
                    my %r;

                    $p->{$key} = \%r;
                } else {
                    $p->{$key} = $value;
                }
            }
        }

--
        Bart.



Fri, 20 Aug 2004 23:01:50 GMT  
 Text Parsing - Parse::RecDescent or another method?


Quote:
>> I'm trying to take a file in the format:

>> BEGIN:VCARD
>> VERSION:2.1
>> N:Pryce, Chris
>> TEL;PREF;WORK: (763) xxx-xxxx
>> FN:Pryce;Chris
>> ADR;PREF;WORK:;;1234 Elm Street\nApt. 2;Smalltown;MN;55555
>> TEL;CELL: (612) xxx-xxxx
>> END:VCARD

> That format looks too simple to me to warrant a
> multiline/recursive-capable parser.

>> There would be one or more vCards per file.

> Even then.

>> I'd like to end up with a array of hashrefs.

>> $card[0] = (
>>            {ADR => {
>>                        WORK => {   #deleting the PREF...
>>                                   street1=>"1234 Elm Street",
>>                                   street2=>"Apt. 2"
>>                                   city=>"Smalltown",
>>                                   State=>"MN",
>>                                   PostalCode=>"55555"
>>                                 }
>>                    }
>>             },
>>            {TEL => {
>>                        WORK=>"(763) 475-0010", #deleting the PREF...
>>                        CELL=>"(612) 790-1965"
>>                    }
>>            },

>> Etc...

> Wow. Hold it. Make up your mind. $card[0] cannot contain two hashes each
> with just one item.

I was thinking of accessing the values as $card[0]->{ADR}{WORK},
$card[0]->{TEL}{WORK}

Did I{*filter*}up the definition? Is that a workable format? Or too complex?

Quote:
> I think

> $card[0] = { ADR => { ... }, TEL => ..., N => ..., ... }

> will do.

[snip code reference]

I appreciate the code reference, I am trying it on for size now.

cp



Fri, 20 Aug 2004 23:33:53 GMT  
 Text Parsing - Parse::RecDescent or another method?


Quote:

>> I'm trying to take a file in the format:

>> BEGIN:VCARD
>> VERSION:2.1
>> N:Pryce, Chris
>> TEL;PREF;WORK: (763) xxx-xxxx
>> FN:Pryce;Chris
>> ADR;PREF;WORK:;;1234 Elm Street\nApt. 2;Smalltown;MN;55555
>> TEL;CELL: (612) xxx-xxxx
>> END:VCARD

> That format looks too simple to me to warrant a
> multiline/recursive-capable parser.

>> There would be one or more vCards per file.

> Even then.

>> I'd like to end up with a array of hashrefs.

>> $card[0] = (
>>            {ADR => {
>>                        WORK => {   #deleting the PREF...
>>                                   street1=>"1234 Elm Street",
>>                                   street2=>"Apt. 2"
>>                                   city=>"Smalltown",
>>                                   State=>"MN",
>>                                   PostalCode=>"55555"
>>                                 }
>>                    }
>>             },
>>            {TEL => {
>>                        WORK=>"(763) 475-0010", #deleting the PREF...
>>                        CELL=>"(612) 790-1965"
>>                    }
>>            },

>> Etc...

[snip]

> I think I'd go for line by line processing. Even split might do.

> The next code appears to produce a result close to what you ask for:


> while(<>) {
>    if ((my $new=/^BEGIN:VCARD$/) .. /^END:VCARD$/) {

>        chomp;
>        my($name, $value)= split /:/, $_, 2;
>        next if $name =~ /^(?:BEGIN|END|VERSION)$/;

>        my $p = $record;


>            $p = $p->{$_} ||= {};
>        }

>            my %r;

>              split /;/, $value, -1;
>            $p->{$key} = \%r;
>        } elsif($key eq 'FN') {
>            my %r;

>            $p->{$key} = \%r;
>        } else {
>            $p->{$key} = $value;
>        }
>    }
> }

It is very close, and thank you. However, as the data above shows, some
records have more than one TEL (a WORK, a CELL, sometimes a FAX and OTHER).

The code is making one TEL field, set to what ever the last one was. I'm not
sure how to fix that. I'll keep working on it though...

cp



Sat, 21 Aug 2004 00:42:50 GMT  
 Text Parsing - Parse::RecDescent or another method?

Quote:

>> Wow. Hold it. Make up your mind. $card[0] cannot contain two hashes each
>> with just one item.

>I was thinking of accessing the values as $card[0]->{ADR}{WORK},
>$card[0]->{TEL}{WORK}

>Did I{*filter*}up the definition? Is that a workable format? Or too complex?

I think my version will do just that.

        $card[0]->{ADR}{WORK}

or

        $card[0]{ADR}{WORK}

is an array item, that is a hash 'reference) with "ADR" as a key, which
contains a hash (reference), with "WORK" as a key. So:

        $card[0] = { ADR => { WORK => { ... } } };

is one way to describe or produce the structure.

--
        Bart.



Sat, 21 Aug 2004 02:28:16 GMT  
 Text Parsing - Parse::RecDescent or another method?


Quote:
>> I was thinking of accessing the values as $card[0]->{ADR}{WORK},
>> $card[0]->{TEL}{WORK}

>> Did I{*filter*}up the definition? Is that a workable format? Or too complex?

> I think my version will do just that.

> $card[0]->{ADR}{WORK}

> or

> $card[0]{ADR}{WORK}

> is an array item, that is a hash 'reference) with "ADR" as a key, which
> contains a hash (reference), with "WORK" as a key. So:

> $card[0] = { ADR => { WORK => { ... } } };

> is one way to describe or produce the structure.

Yes it does. I goofed up the data slightly. Your version works as expected
with the data set I gave you. Thank you. I've adapted it slightly to be more
flexible, and it will cover me for now.

cp



Sat, 21 Aug 2004 16:06:32 GMT  
 Text Parsing - Parse::RecDescent or another method?
[snip]

Quote:
>         while(<>) {
>             if ((my $new=/^BEGIN:VCARD$/) .. /^END:VCARD$/) {


I'm not so sure that assigning to $new in the left side of scalar ..
is such a good idea.  It's too much like:
   my $x if 0;

To see this:
perl -e"for(1..5) { my $x if 0; print ++$x }"
12345
perl -e"for(1..5) { if((my$x=1) .. !$_) {print ++$x}}"
21234
perl -e"for(1..5) { if(1 or my $x=0) {print ++$x} }"
12345

I'm not quite sure why in my .. example produces 21234 rather than
23456, but it does show a good reason why depending on a my() in either
side of scalar ..

I would suggest changing it to:
            if (my $new=(/^BEGIN:VCARD$/ .. /^END:VCARD$/)) {

--
print reverse( ",rekcah", " lreP", " rehtona", " tsuJ" )."\n";



Tue, 24 Aug 2004 00:56:27 GMT  
 Text Parsing - Parse::RecDescent or another method?


Quote:
> I would suggest changing it to:
>           if (my $new=(/^BEGIN:VCARD$/ .. /^END:VCARD$/)) {


Except that breaks the code.

if (my $new=(/^BEGIN:VCARD$/ .. /^END:VCARD$/)) {

Or ..

if (my $new=(/^BEGIN:VCARD$/) .. (/^END:VCARD$/) ){

# produces an array of hash references, where each line in the vCard file is
a separate hash reference.

 if ((my $new=/^BEGIN:VCARD$/) .. /^END:VCARD$/) {

# produces an array of hash references where each vCard is a hash reference
of hash references. The desired effect, since there can be more that one
vCard per file.

--------------
#!/usr/local/bin/perl -w

use strict;
use Data::Dumper;

# parameters that we don't care about for now
my %exclude = map{ $_=>1 } qw(  DOM PREF PARCEL POSTAL INTERNET internet
                                ENCODING=QUOTED-PRINTABLE NGW
QUOTED-PRINTABLE);


    while(<DATA>) {
        if ((my $new=/^BEGIN:VCARD$/) .. /^END:VCARD$/) {

            chomp;
            my($name, $value)= split /:/, $_, 2;
            next if !$value;
            next if $name =~ /^(?:BEGIN|END|VERSION|X-.*|UID)$/;

            my $p = $record;


                $p = $p->{$_} ||= {};
            }
            if($key eq 'FN') {
                my %r;

                $p->{$key} = \%r;
            } else {
                $p->{$key} = $value;
            }
        }
    }


__DATA__
BEGIN:VCARD
VERSION:2.1
N:Pryce, Chris
TEL;PREF;WORK: (xxx) xxx-xxxx
FN:Pryce;Chris
ADR;PREF;WORK:;;1234 Elm Street\nApt. 2;Smalltown;MN;55555
TEL;CELL: (xxx) xxx-xxxx
END:VCARD

BEGIN:VCARD
VERSION:2.1
N:Smith, Joe
TEL;PREF;WORK: (xxx) xxx-xxxx
TEL;FAX: (xxx) xxx-xxxx
FN:Pryce;Chris
ADR;PREF;WORK:;;4321 Main Street\nApt. 1;Smalltown;MN;55555
TEL;CELL: (xxx) xxx-xxxx
END:VCARD



Tue, 24 Aug 2004 16:28:10 GMT  
 Text Parsing - Parse::RecDescent or another method?

Quote:



> > I would suggest changing it to:
> >           if (my $new=(/^BEGIN:VCARD$/ .. /^END:VCARD$/)) {

> Except that breaks the code.

> if (my $new=(/^BEGIN:VCARD$/ .. /^END:VCARD$/)) {

> Or ..

> if (my $new=(/^BEGIN:VCARD$/) .. (/^END:VCARD$/) ){

> # produces an array of hash references, where each line in the vCard
> # file is a separate hash reference.

Have you tested this?  Note that I changed "if $new" to "if $new == 1".
If you forgot to make that change when testing, then you will of course
get the error you describe.

Here's some similarly structured code, using the "if $new == 1"

perl -ne'
BEGIN{$/=\1}
if(my $new = (/</../>/)) {


Quote:
}


abcde<fghi>jkl<mnop>qrs
^D
ARRAY(0x177f174) < f g h i >
ARRAY(0x17755c8) < m n o p >

See how well it works?

But if one leaves it as "if $new;", then one gets the broken behavior
that you describe:

perl -ne'
BEGIN{$/=\1}
if(my $new = (/</../>/)) {


Quote:
}


abcde<fghi>jkl<mnop>qrs
^D
ARRAY(0x177f174) <
ARRAY(0x177f24c) f
ARRAY(0x1775598) g
ARRAY(0x17755e0) h
ARRAY(0x1775628) i
ARRAY(0x1775670) >
ARRAY(0x17757cc) <
ARRAY(0x1771c20) m
ARRAY(0x1771c44) n
ARRAY(0x1771c68) o
ARRAY(0x1771c8c) p

--
print reverse( ",rekcah", " lreP", " rehtona", " tsuJ" )."\n";



Wed, 25 Aug 2004 06:06:31 GMT  
 Text Parsing - Parse::RecDescent or another method?

Quote:

>Have you tested this?  Note that I changed "if $new" to "if $new == 1".
>If you forgot to make that change when testing, then you will of course
>get the error you describe.

You would have better changed the name of the variable. My $new is a
flag, indicating if this is a new group. Yours isn't, at all. A good
name would be $loopcount.

--
        Bart.



Wed, 25 Aug 2004 11:26:09 GMT  
 Text Parsing - Parse::RecDescent or another method?

Quote:

> Have you tested this?  Note that I changed "if $new" to "if $new == 1".
> If you forgot to make that change when testing, then you will of course
> get the error you describe.

Ah. Now it makes sense. I did miss that.

--
cp

remove 'nospam' to reply



Thu, 26 Aug 2004 16:01:54 GMT  
 
 [ 11 post ] 

 Relevant Pages 

1. Parsing with Parse::RecDescent

2. Parse::RecDescent and parsing comments

3. Help: Problem with simple parsing and Parse::RecDescent

4. Parsing with Parse::RecDescent

5. Parse::RecDescent stops parsing.

6. Having Trouble with Parse::RecDescent on Solaris

7. Converting SQL89 YACC rule to Parse::RecDescent

8. ANNOUNCE: Parse::RecDescent 1.42

9. ANNOUNCE: Parse::RecDescent 1.41

10. ANNOUNCE: Parse::RecDescent 1.35

11. ANNOUNCE: Parse::RecDescent 1.30

12. ANNOUNCE: Parse::RecDescent 1.66

 

 
Powered by phpBB® Forum Software