Batch conversion of MS-WORD *.DOC files 
Author Message
 Batch conversion of MS-WORD *.DOC files

Has anyone has any experience of converting DOC files into semi-readable
ascii/text using awk/sed

I know that 'strings' or 'od' can help, but I would like to
improve the actual formatting that they produce and also remove
the 'rubbish' at the head/tail of each file

Thanks
Mark
--
Mark Katz
ISPC, London - Innovation in data-delivery tools
Tel: (44) 181-455 4665, Fax (44) 181-458 9554
** Visit our website on http://www.*-*-*.com/ **



Mon, 07 Aug 2000 03:00:00 GMT  
 Batch conversion of MS-WORD *.DOC files


% Has anyone has any experience of converting DOC files into semi-readable
% ascii/text using awk/sed

There's an application called something like doc2x, which converts
DOC files into semi-readable text. Offhand, I don't know where to
get it, but you might try a web search.
--

Patrick TJ McPhee
East York  Canada



Wed, 09 Aug 2000 03:00:00 GMT  
 Batch conversion of MS-WORD *.DOC files



Quote:


>% Has anyone has any experience of converting DOC files into semi-readable
>% ascii/text using awk/sed

>There's an application called something like doc2x, which converts
>DOC files into semi-readable text. Offhand, I don't know where to
>get it, but you might try a web search.

It is called WORD2X - easily found on the web, but seems to be
LINUX (/perl?) based

Thanks to all other direct responses from awk'ers. Two other useful
leads were

ELSER and LAOLA

but they too may have been LINUX/UNIX based using platforms/languages
that are available on DOS/PC

I'll keep you all posted

Mark
--
Mark Katz
ISPC, London - Innovation in data-delivery tools
Tel: (44) 181-455 4665, Fax (44) 181-458 9554
** Visit our website on http://www.efiche.com/efiche **



Fri, 11 Aug 2000 03:00:00 GMT  
 Batch conversion of MS-WORD *.DOC files

Pipe the *.doc file through something like the following.  Please
post any improvements.  This is not even close to perfect.

                                        -- Wayne Bergeron

#!gawk -f
#
# Quick and dirty way to get rid of control stuff in a Microsoft
# Word document file.
#
BEGIN   {
#
#       Build a collating sequence in associative arrays.
#
        for (i=0; i<256; i++) {
                c       = sprintf("%c", i);
                chr[c]  = i;
                }
        }
#
#       Handle each record.
#
        {
        gsub(/\000+/, "");    # Get rid of null characters
        gsub(/\377+/, "");    # Get rid of -1 characters
#
#       Change CTRL-K to newlines.
#
        gsub(/\013/, "\n");
#
#       Get rid of multiple control characters
#
        line    = "";
        l       = length($0);
        for (i=1; i<=l; i++) {
                c = substr($0, i, 1);
                n = chr[c];
                if ((n>=32 && n<=126) || (n>=7 && n<=14))
                        line = line c;
                }
        $0 = line;
        gsub(/[[:cntrl:][:punct:][:space:]]{3,}/, "", $0);
        print;
        }


Quote:



>% Has anyone has any experience of converting DOC files into semi-readable
>% ascii/text using awk/sed

>There's an application called something like doc2x, which converts
>DOC files into semi-readable text. Offhand, I don't know where to
>get it, but you might try a web search.
>--

>Patrick TJ McPhee
>East York  Canada




Sat, 12 Aug 2000 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Class(y) 2.4b, with documentation for MS-Word 97, MS-Word 6.0 and WordPerfect 7.0

2. How do I print a word.doc file?

3. Word to PDF file conversion

4. Output a text file to MS Word Format

5. Code to recognize MS-Word document files?

6. printing to file from MS Word

7. DDE Command reference for MS Office / MS Word

8. Printing a Word Doc

9. Print Word Doc on a report?

10. Batch conversion with CPD 2.1 Filer

11. Print Word.doc as header?

12. Write strings in a Word doc with presentation

 

 
Powered by phpBB® Forum Software