[ANNOUNCE} Gutenberg Text to HTML converter 
Author Message
 [ANNOUNCE} Gutenberg Text to HTML converter

I've attached a script that converts text files from the Project
Gutenberg archives to HTML. I wrote this so I could read Gutenberg
texts on my palm using Plucker.

I wrote this while watching the Canada/USA Olympic hockey final so it
didn't take too long. I read Chapter 2 of the Perl book all of last
week. I have tested this on a couple of Gutenberg files on Solaris
and Linux with perl 5.6.1. I don't know how it will fare with perl
under Windows.

I need people to test this script with as many files as possible and
see if anything breaks. Send me the filename you tested it with and
what didn't work. This script is not complete, there's more to be
done.

This script has limited heuristics. It will NOT format plays or poems
from the Gutenberg Archives, only paragraph texts.  It also assumes
that the input file has MSDOS formatting (CR/LF line termination) as
most of the files seem to have.

I welcome comments and improvements from the perl gurus.

---cut-here------------------------------------------------------------------
#!/usr/local/bin/perl5.6.1
# -*-cperl-*-
# gut: takes a project Gutenberg text file and converts it to html.
# Reads from stdin and outputs to stdout.
# Usage: gut <file.txt >file.html

# Luis Fernandes <elf <at> ee.ryerson.ca>
# $Id: gut,v 1.1 2002/02/24 19:47:32 elf Exp elf $

# This script is released under the GPL.

# Notes: This script has limited heuristics. It will NOT format plays
# or poems from the Gutenberg Archives, only paragraph texts.  It also
# assumes that the input file has MSDOS formatting (CR/LF line
# termination) as most of the files seem to have.

# TODO: Example ftp session has to be formatted with <pre>, "Chapter"
# text should be formatted with <h3>.

# Historical note: This script was written while watching the 2002
# Olympics hockey game between the US and Canada. It's 4-2 Canada at
# 17:02 EST. Just as I finished writing the preceeding phrase, Canada
# scored again; it's now 5-2. We win gold!

# We define a series of arrays based on the formatting we wish to
# do. The arrays contain lines to be matched and formatted accordingly.

# This is an array containing known lines in a PG text file that we
# want to format as 2nd level headers.  IMPORTANT: ANY META CHARACTERS
# CONTAINED IN THESE STRINGS MUST BE ESCAPED.

         "\*\*Welcome To The World of Free Plain Vanilla Electronic Texts\*\*",
         "Information about Project Gutenberg",
         "Information prepared by the Project Gutenberg legal advisor",
         "The Legal Small Print",
         "\*\*Etexts Readable By Both Humans and By Computers, Since 1971\*\*",
         "\*\*\*\*\*These Etexts Are Prepared By Thousands of Volunteers!\*\*\*\*\*",
         "\*These Etexts Prepared By Hundreds of Volunteers and Donations\*",
);

#This is an array containing known lines in a PG text file that we

        "We need your donations more than ever!",
        "Please take a look at the important information in this header.",
);

#This is an alist of known ASCII sequences in a PG text file that we
#want to replace with HTML sequences.
%special=(
                  # replace 5 consecutive blank lines with a horizontal rule
                  "\r\n\r\n\r\n\r\n\r" => "\n<hr>\n",

                  # replace 2 consecutive ^Ms with a <p>
                  "\r\n\r" => "<p>\n",

                  # replace 3 blank lines with a wide spacer.
#                 "\r\n\r\n\r\n" => "\n<p><br><\/p>\n",                
);

$/="";                                                        # gulp mode

printf("<html>\n<body>\n");               # header

while(<>){

  # Format 2nd level headers.

        s/\Q$k\E/<h2>$k<\/h2>/;
  }

   # Format bold

        s/\Q$k\E/<b>$k<\/b>/;
  }

   #special formatting
   foreach $k (keys(%special)){
         s/\Q$k\E/$special{$k}/g;
   }

   printf("$_\n");

Quote:
}

#trailer
printf("<hr>\n");
printf("</body>\n</html>\n");


Thu, 12 Aug 2004 23:59:18 GMT  
 [ANNOUNCE} Gutenberg Text to HTML converter

groups list is trimmed of palm group.

  LF> #!/usr/local/bin/perl5.6.1

no -w

no use strict

  LF> # This is an array containing known lines in a PG text file that we
  LF> # want to format as 2nd level headers.  IMPORTANT: ANY META CHARACTERS
  LF> # CONTAINED IN THESE STRINGS MUST BE ESCAPED.

wrong. you escape them below in the s/// commands with \Q. more to the
point any chars that are parse in double quotes need to be escaped. but
even easier would be to use single quotes. and in either case you don't
need to escape * (especially since you handle that later).


  LF>         "\*\*Welcome To The World of Free Plain Vanilla Electronic Texts\*\*",
  LF>         "Information about Project Gutenberg",
  LF>         "Information prepared by the Project Gutenberg legal advisor",
  LF>         "The Legal Small Print",
  LF>         "\*\*Etexts Readable By Both Humans and By Computers, Since 1971\*\*",
  LF>         "\*\*\*\*\*These Etexts Are Prepared By Thousands of Volunteers!\*\*\*\*\*",
  LF>         "\*These Etexts Prepared By Hundreds of Volunteers and Donations\*",
  LF> );

  LF> #This is an alist of known ASCII sequences in a PG text file that we
  LF> #want to replace with HTML sequences.
  LF> %special=(
  LF>                  # replace 5 consecutive blank lines with a horizontal rule
  LF>                  "\r\n\r\n\r\n\r\n\r" => "\n<hr>\n",

  LF>                  # replace 2 consecutive ^Ms with a <p>
  LF>                  "\r\n\r" => "<p>\n",

  LF>                  # replace 3 blank lines with a wide spacer.
  LF> #                "\r\n\r\n\r\n" => "\n<p><br><\/p>\n",                
  LF> );

  LF> $/="";                                                       # gulp mode

paragraph mode (never heard it called gulp mode).

that should be localized and not changed globally.

  LF> printf("<html>\n<body>\n");              # header

why use printf if you never use a printf format? use plain print. this
is perl, not c.

  LF> while(<>){

  LF>   # Format 2nd level headers.

  LF>                s/\Q$k\E/<h2>$k<\/h2>/;
  LF>   }

what if there are more than one of those per paragraph? add the /g
modifier.

  LF>    printf("$_\n");
  LF> }

  LF> #trailer
  LF> printf("<hr>\n");
  LF> printf("</body>\n</html>\n");

more useless uses of printf.

uri

--

-- Stem is an Open Source Network Development Toolkit and Application Suite -
----- Stem and Perl Development, Systems Architecture, Design and Coding ----
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org



Fri, 13 Aug 2004 00:54:19 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Announce: v1.174 t2html.pl Text to html converter in CPAN

2. ANNOUNCE: Simple text to html converter (t2html Perl 5.003)

3. delimited text to HTML table converter

4. Form Text Box Spell Checker, HTML Converter?

5. ANNOUNCE: v1.138 t2html.pl Text to html conversion script, Perl 5.004+

6. ANNOUNCE: PerlPoint-Converters 0.009

7. ANNOUNCE: RosettaMan 2.0, a manual page converter

8. ANNOUNCE: PerlPoint::Package 0.35, PerlPoint::Converters 0.11

9. ANNOUNCE: C++/Perl to LaTeX converter

10. A TkPerl5 converter and balanced-text splitter

11. C to hyper text converter.

12. dbm or text to dbf (dbase3) file converter

 

 
Powered by phpBB® Forum Software