Help please extracting data from a word document 
Author Message
 Help please extracting data from a word document

Hello,

I have several big word documents i need to extract data from. First i
just want to try something simple and after, it should be easy to add
more stuff. In the word document, regularly, there is this string :
"Display Coordinate System Name:"
and then a certain amount of spaces, then a string i want to extract
that can be a mix of letters and numbers. Everything is in the same
line and the string to extract has something like 10 characters.

I almost do not know anything about perl. What should i put in my
extract.pl file?
then to execute the file, i am not sure at 100%.
Is that : extract.pl -o input.doc output.txt  ????

Thanks a lot for your help

Pierrot



Wed, 23 Mar 2005 02:31:11 GMT  
 Help please extracting data from a word document

Quote:

> Hello,

> I have several big word documents i need to extract data from. First i
> just want to try something simple and after, it should be easy to add
> more stuff. In the word document, regularly, there is this string :
> "Display Coordinate System Name:"
> and then a certain amount of spaces, then a string i want to extract
> that can be a mix of letters and numbers. Everything is in the same
> line and the string to extract has something like 10 characters.

> I almost do not know anything about perl. What should i put in my
> extract.pl file?
> then to execute the file, i am not sure at 100%.
> Is that : extract.pl -o input.doc output.txt  ????

> Thanks a lot for your help

> Pierrot

Do you have Perl installed on your system? What is the OS running on your
system? etc.

The untested stuff below assumes that you are running Li/Unix with Perl
executable in /usr/bin/perl

extract.pl:

#!/usr/bin/perl -w
use strict;

open(FH, $ARGV[0]) or die "cannot open $ARGV[0]: $!\n";
while (<FH>) {
      chomp;
      if (/Display Coordinate System Name:\s+(.*)/) {
           print "Extracted string: $1\n";
      }

Quote:
}

close FH;

On *nices, you can kick this script: at command prompt:
$ ./extract.pl input.doc

This will print extracted strings to your screen. To redirect it to a file,
enter following:
$ ./extract.pl input.doc > output.txt

On Windows, you do something similarly
C:/>perl extract.pl input.doc
or
C:/>perl extract.pl input.doc > output.txt

PS: Pick up a Perl book.



Wed, 23 Mar 2005 04:32:16 GMT  
 Help please extracting data from a word document

Quote:

> Hello,

> I have several big word documents i need to extract data from. First i

Tuts 16 & 17:

http://savage.net.au/Perl-tuts-1-30.html

--
Cheers
Ron Savage

http://savage.net.au/index.html



Wed, 23 Mar 2005 04:35:37 GMT  
 Help please extracting data from a word document


Quote:

> Do you have Perl installed on your system? What is the OS running on your
> system? etc.

> The untested stuff below assumes that you are running Li/Unix with Perl
> executable in /usr/bin/perl

> extract.pl:

> #!/usr/bin/perl -w
> use strict;

> open(FH, $ARGV[0]) or die "cannot open $ARGV[0]: $!\n";
> while (<FH>) {
>       chomp;
>       if (/Display Coordinate System Name:\s+(.*)/) {
>            print "Extracted string: $1\n";
>       }
> }
> close FH;

> On *nices, you can kick this script: at command prompt:
> $ ./extract.pl input.doc

> This will print extracted strings to your screen. To redirect it to a
file,
> enter following:
> $ ./extract.pl input.doc > output.txt

> On Windows, you do something similarly
> C:/>perl extract.pl input.doc
> or
> C:/>perl extract.pl input.doc > output.txt

> PS: Pick up a Perl book.

Well, I didn't catch your line "Word Documents". This doesn't work with MS
Word documents as they are compound files formatted in an object model
(Structured Storage). You might want to either check out Perl::OLE to deal
with MS structured storage or convert these MS Word documents to plain text
files, then run them the aforementioned script.


Wed, 23 Mar 2005 04:50:47 GMT  
 Help please extracting data from a word document

Quote:


> > Hello,

> > I have several big word documents i need to extract data from. First i
> > just want to try something simple and after, it should be easy to add
> > more stuff. In the word document, regularly, there is this string :
> > "Display Coordinate System Name:"
> > and then a certain amount of spaces, then a string i want to extract
> > that can be a mix of letters and numbers. Everything is in the same
> > line and the string to extract has something like 10 characters.

> > I almost do not know anything about perl. What should i put in my
> > extract.pl file?
> > then to execute the file, i am not sure at 100%.
> > Is that : extract.pl -o input.doc output.txt  ????

> > Thanks a lot for your help

> > Pierrot

> Do you have Perl installed on your system? What is the OS running on your
> system? etc.

> The untested stuff below assumes that you are running Li/Unix with Perl
> executable in /usr/bin/perl

> extract.pl:

> #!/usr/bin/perl -w
> use strict;

> open(FH, $ARGV[0]) or die "cannot open $ARGV[0]: $!\n";
> while (<FH>) {
>       chomp;
>       if (/Display Coordinate System Name:\s+(.*)/) {
>            print "Extracted string: $1\n";
>       }
> }
> close FH;

> On *nices, you can kick this script: at command prompt:
> $ ./extract.pl input.doc

> This will print extracted strings to your screen. To redirect it to a file,
> enter following:
> $ ./extract.pl input.doc > output.txt

> On Windows, you do something similarly
> C:/>perl extract.pl input.doc
> or
> C:/>perl extract.pl input.doc > output.txt

> PS: Pick up a Perl book.

In fact, I am running Win XP  and ActiveState Active Perl5.6.1
ok, Thanks for your help, i will try to make that working. I have also
downloaded the tutorials from Ron so i should be able to do it!

Thanks very much

Pierrot



Wed, 23 Mar 2005 18:48:28 GMT  
 Help please extracting data from a word document
That works ! Thanks. In fact, i open first the word doc with notepad
(win xp/active perl), then i save it in txt file so that's more simple
and after, your code works. Now i have another question :

Now that i can extract the string next to "Display Coordinate System
Name:" on the same line, If I have  :

3.1 Frame Defeinition
       X        Y        Z
      65.2   -103.8    204.9

3.2 another Frame that i do not want
       X        Y        Z
      33.3   89.56    -562.0

How should I proceed to extract the 3 numbers that are between the
paragrah 3.1 and 3.2 ??

I tried first that code even to catch the 6 of them, just to test if i
manage to catch the numbers but i did not catch anything:

if (/X        Y        Z\d+(.*)+\d+(.*)+\d+(.*)/) {
            print "Extracted string: $1  $2  $3\n";
       }

Any help is really appreciated. Thanks a lot in advance

Pierrot

Quote:


> > Do you have Perl installed on your system? What is the OS running on your
> > system? etc.

> > The untested stuff below assumes that you are running Li/Unix with Perl
> > executable in /usr/bin/perl

> > extract.pl:

> > #!/usr/bin/perl -w
> > use strict;

> > open(FH, $ARGV[0]) or die "cannot open $ARGV[0]: $!\n";
> > while (<FH>) {
> >       chomp;
> >       if (/Display Coordinate System Name:\s+(.*)/) {
> >            print "Extracted string: $1\n";
> >       }
> > }
> > close FH;

> > On *nices, you can kick this script: at command prompt:
> > $ ./extract.pl input.doc

> > This will print extracted strings to your screen. To redirect it to a
>  file,
> > enter following:
> > $ ./extract.pl input.doc > output.txt

> > On Windows, you do something similarly
> > C:/>perl extract.pl input.doc
> > or
> > C:/>perl extract.pl input.doc > output.txt

> > PS: Pick up a Perl book.

> Well, I didn't catch your line "Word Documents". This doesn't work with MS
> Word documents as they are compound files formatted in an object model
> (Structured Storage). You might want to either check out Perl::OLE to deal
> with MS structured storage or convert these MS Word documents to plain text
> files, then run them the aforementioned script.



Wed, 23 Mar 2005 21:50:29 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Extracting information from Word 7.0 documents.

2. how to extract data from html table document ?

3. creating MS word documents without installing MS word

4. Help with Translating a Word Processed document

5. How to: Create Regex which extracts N number of words before target word

6. Extracting Data Help.

7. Please help me how is easiest way to extract text between some variable text

8. Extracting from text files - Please help!

9. *** PLEASE HELP *** Extracting the return address on an email message

10. Extracting data with Perl Help!

11. Help! newbie extract data from file

12. Extracting PDF document metadata in Perl?

 

 
Powered by phpBB® Forum Software