comparing an input file with an output file 
Author Message
 comparing an input file with an output file

I have a text file with 900 words (each lower case, one word per line)
that I want to compare to the contents of an html file.  When any of
those
900 words are found in the html file, I want to replace them with an
html tag.

For instance, if the word stomach is found I want to replace it with
<a href=javascript:def('$stomach')stomach</a>

Does anybody know how to do this?

Thanks! -Neil



Thu, 22 Nov 2001 03:00:00 GMT  
 comparing an input file with an output file


Quote:
>I have a text file with 900 words (each lower case, one word per line)
>that I want to compare to the contents of an html file.  When any of
>those
>900 words are found in the html file, I want to replace them with an
>html tag.

>For instance, if the word stomach is found I want to replace it with
><a href=javascript:def('$stomach')stomach</a>

>Does anybody know how to do this?

Write an awkscript and use the gsub function to make the changes.

The awkscript would read the word_list into an array, and then lines
of the html file  read.

FILENAME=="word_list" {word[++word_cnt]=$0;next}

A for loop could then do the replacements in the line, looping over
the word-list array and finally printing the line.

The gsub function in your case would look something like this:

{for(i=1;i<=word_count;i++){
    replace=prefix "" word[i] "" midfix "" word[i] "" suffix
    gsub(word[i],replace)}
print}

where word_count is the number of words in the word array loaded
from the word_list file.

prefix, midfix, and suffix would be strings assigned in a BEGIN
section, probably using sprintf which will allow the quotes to be
put into the strings by sprinting them as charaters.

I'll leave this for you to do, as well as working out any problems
there might be with the code snippets I showed.  :-)

The awkscript would be invoked like this:

    gawk -f my_awkscript word_file html_file

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Thu, 22 Nov 2001 03:00:00 GMT  
 comparing an input file with an output file

Quote:
Demas) writes:


>>I have a text file with 900 words (each lower case, one word per line)
>>that I want to compare to the contents of an html file.  When any of
>>those
>>900 words are found in the html file, I want to replace them with an
>>html tag.

>>For instance, if the word stomach is found I want to replace it with
>><a href=javascript:def('$stomach')stomach
...
>Write an awkscript and use the gsub function to make the changes.

>The awkscript would read the word_list into an array, and then lines
>of the html file  read.

>FILENAME=="word_list" {word[++word_cnt]=$0;next}

>A for loop could then do the replacements in the line, looping over
>the word-list array and finally printing the line.

>The gsub function in your case would look something like this:

>{for(i=1;i<=word_count;i++){
>    replace=prefix "" word[i] "" midfix "" word[i] "" suffix
>    gsub(word[i],replace)}
>print}

<snip>

This is WAY INEFFICIENT! On each line of the HTML file, substitute each of the
~900 entries in word_list into $0! This would run far quicker if instead you
check each word in each line of the HTML file to see if it's in word_list (you
may need to add code to skip preexisting tags), and only if it is perform the
substitution.

{
  for (i = 1; i <= NF; ++i) {
    w = $i
    if (w ~ /[^A-Za-z]/) gsub(/[^A-Za-z]/, "", w)
    if (w in word) {
      replace = prefix "" w "" midfix "" w "" suffix
      gsub(w, replace, $i)
    }
  }

Quote:
}

The 'if (w ~ . . .' line is needed to remove non-letters from each field, such
as normal punctuation in case FS has it's default value.

On the assumptions that (1) the average number of words per line in the HTML
file is 30 or fewer and (2) that the HTML file is sufficiently large that
translating the script code is a small portion of total run time, this revision
should lead to significantly faster (10-fold speedup or more) execution time.

If this question had been asked of perl, then loading the entire file into a
single text variable and iterating through word_list would be a viable
approach. I suppose if one set FS = RS = sprintf(" %c ", 0), gawk could do
pretty much the same thing with nonpathological HTML files.



Thu, 22 Nov 2001 03:00:00 GMT  
 comparing an input file with an output file

Quote:
> I have a text file with 900 words (each lower case, one word per line)
> that I want to compare to the contents of an html file.  When any of
> those
> 900 words are found in the html file, I want to replace them with an
> html tag.

> For instance, if the word stomach is found I want to replace it with
> <a href=javascript:def('$stomach')stomach</a>

I'd read the wordfile in a BEGIN statement:

BEGIN \
{
  while((getline <"wordfile")>0)
  {
    words[$0] = 1
  }

Quote:
}

And then in the main loop:

{
  for(i in words)   #  i are all the indices
    if(i ~ $0)      #  if one of them matches them
      gsub(i,"<a href=javascript:def('$" i "')" i "</a>")

Quote:
}

Regards...
                Michael


Thu, 22 Nov 2001 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. newbie would like to break input file and output to separate files

2. Reading from input file writing to output file

3. sed: input file = output file

4. Mutiple output files single Input file

5. Single file input ==> multi file output

6. with-input-from-file, with-output-to-file

7. with-input-from-file, with-output-to-file

8. Comparing value in an input field to any value from another file

9. TEMPORARY FILE during input and output??

10. open-{input,output}-file

11. Text file input/output

12. reading input files and then output

 

 
Powered by phpBB® Forum Software