Handy way to id the end of file [revisited] 
Author Message
 Handy way to id the end of file [revisited]

Quote:

> The following script is untested and is intended only to demonstrate
> a possible approach to solving your problem:

> #!/usr/bin/awk -f

> BEGIN {
>     from_pattern = "^From:.*Putnam"
>     xref_pattern = "^Xref:.*ding2"
>     file_pattern = "\.fmt"
> }

> {
>     if (FNR == 1) {
>         in_header  = 1
>         in_matched = 0
>         from_line  = ""
>         xref_line  = ""
>         file_line  = ""
>     } else if (in_header && /^$/) {
>         in_header  = 0
>         if (from_line && xref_line) {
>             in_matched = 1
>         }
>     }

>     if (in_header) {  # in header section
>         if ($0 ~ from_pattern) {
>             from_line = $0
>         } else if ($0 ~ xref_pattern) {
>             xref_line = $0
>         }
>     } else if (in_matched) {  # in body of matched message
>         if ($0 ~ file_pattern) {
>             if (! file_line) {
>                 print FILENAME, from_line
>                 print FILENAME, xref_line
>             }
>             file_line = $0
>             print FILENAME, file_line
>         }
>     } else {  # in body of unmatched message
>         # nextfile  # GNU awk extension to skip to next file
>     }
> }

> You should be able to follow the logic of this simple state machine.
> Notice the lack of any arithmetic operations (e.g., incrementation).
> What is there to count? This program is about state, not math.

I should be able to, its true, but still having trouble digesting a
couple of things.

1)

As Patrick M. explained in an earlier post:

  "if (filename)
  is short for
  if (filename != "")"

So in the code above:

Quote:
>     } else if (in_matched) {  # in body of matched message
>         if ($0 ~ file_pattern) {
>             if (! file_line) {

What is that last line short for?
 if (! file_line != "")  ?

2)

It seems to be indicated from the thread about this script that it
needs an END clause to process the last file.

When I run a facsimile of this againt 3 files known to have all the
ingredients, it prints the information from them all, so is seemingly
already processing the last file.



Thu, 16 Jan 2003 03:00:00 GMT  
 Handy way to id the end of file [revisited]

Quote:


> > The following script is untested and is intended only to demonstrate
> > a possible approach to solving your problem:

> > #!/usr/bin/awk -f

> > BEGIN {
> >     from_pattern = "^From:.*Putnam"
> >     xref_pattern = "^Xref:.*ding2"
> >     file_pattern = "\.fmt"
> > }

> > {
> >     if (FNR == 1) {
> >         in_header  = 1
> >         in_matched = 0
> >         from_line  = ""
> >         xref_line  = ""
> >         file_line  = ""
> >     } else if (in_header && /^$/) {
> >         in_header  = 0
> >         if (from_line && xref_line) {
> >             in_matched = 1
> >         }
> >     }

> >     if (in_header) {  # in header section
> >         if ($0 ~ from_pattern) {
> >             from_line = $0
> >         } else if ($0 ~ xref_pattern) {
> >             xref_line = $0
> >         }
> >     } else if (in_matched) {  # in body of matched message
> >         if ($0 ~ file_pattern) {
> >             if (! file_line) {
> >                 print FILENAME, from_line
> >                 print FILENAME, xref_line
> >             }
> >             file_line = $0
> >             print FILENAME, file_line
> >         }
> >     } else {  # in body of unmatched message
> >         # nextfile  # GNU awk extension to skip to next file
> >     }
> > }

> > You should be able to follow the logic of this simple state machine.
> > Notice the lack of any arithmetic operations (e.g., incrementation).
> > What is there to count? This program is about state, not math.

> I should be able to, its true, but still having trouble digesting a
> couple of things.

> 1)

> As Patrick M. explained in an earlier post:

>   if (filename)
>   is short for
>   if (filename != "")

And I cautioned in a follow-up to Patrick's post that these two
conditional expressions are NOT strictly the same, and that the
latter form is slightly better as it will correctly handle a file
named "0", whereas the first form will not.

Quote:
> So in the code above:
> >     } else if (in_matched) {  # in body of matched message
> >         if ($0 ~ file_pattern) {
> >             if (! file_line) {

> What is that last line short for?
>  if (! file_line != "")  ?

No. It's not "short" for anything. It's simply a test of the Boolean
value of the variable file_line.

You seem a bit confused by the Boolean value (or "truth" value) of
strings in awk. In awk, a string is TRUE if it is NOT empty, and
FALSE if it IS empty. This is a very handy and often-used feature
of the language, and one you should become more familiar with.

When you examine what happens when the test is true, and also the
context in which the test occurs, it should be fairly obvious to
you what the test is for:

    ...
    if (in_header) {  # in header section
        ...
    } else if (in_matched) {  # in body of matched message
        if ($0 ~ file_pattern) {
            if (! file_line) {  # < < < < < < < < < < < < < < < < < < <
                print FILENAME, from_line
                print FILENAME, xref_line
            }
            file_line = $0
            print FILENAME, file_line
        }
    } else {  # in body of unmatched message
        ...
    }
    ...

At the point in the program where the test occurs (marked by arrows
above), we know that (1) the current line is within the body of a
matched message (i.e., a message that has both a header matched by
from_pattern and a header matched by xref_pattern); and (2) the
current line matches the regular expression pattern in the variable
file_pattern. So the only thing we do NOT yet know is if the current
line is the first occurrence of such a line within the current
message (i.e., file). And we need to know this because we only want
to print the values of from_line and xref_line ONCE per message,
right? So we want to print those saved header lines at the first
occurrence of a matched line within a matched message.

You might be confused by the use of the value of the variable
file_line for the purpose of testing ordinal value ("first" or
"subsequent"). Ok, let's dispense with that variable and create
another flag variable, this one named "at_first_matched_file_line":

    ...
        if (FNR == 1) {
            at_first_matched_file_line = 1  # initialize to TRUE at the
            ...                             # beginning of each message
        }
        ...
    } else if (in_matched) {  # in body of matched message
        if ($0 ~ file_pattern) {
            if (at_first_matched_file_line) {   # the explicit variable name
                print FILENAME, from_line       # makes this test obvious
                print FILENAME, xref_line
                at_first_matched_file_line = 0  # set to FALSE
            }
            print FILENAME, $0
        }
    ...

Is this clearer?

Quote:
> 2)

> It seems to be indicated from the thread about this script that it
> needs an END clause to process the last file.

You only need an END clause if you have end-of-all-input termination
processing to do. What termination processing is there to do in
the script above? None. You might also ask if you have end-of-file
termination processing to do. Do you? No. You only have initialization
code executing at the beginning of each file (wherever FNR == 1). The
reason this is true is because I conveniently ignored a detail--though
not entirely. Read on.

Quote:
> When I run a facsimile of this againt 3 files known to have all the
> ingredients, it prints the information from them all, so is seemingly
> already processing the last file.

Yep. But it's not printing a trailing record terminator token, is it?


Quote:
> Printing a trailing record terminator token ("-- ") is simply a
> matter of testing the status of the previous message each time you
> read a new file and at the end of all input. I've already posted
> examples of idioms to do just that.

If, after all that has been posted, you cannot do this, then you're
either not trying hard enough to learn to program or you're trying
too hard to learn to not program.

--
Jim Monty

Tempe, Arizona USA



Fri, 17 Jan 2003 03:00:00 GMT  
 Handy way to id the end of file [revisited]

Quote:

>      if (FNR == 1) {
>          at_first_matched_file_line = 1  # initialize to TRUE at the
>          ...                             # beginning of each message
>      }
>      ...
>  } else if (in_matched) {  # in body of matched message
>      if ($0 ~ file_pattern) {
>          if (at_first_matched_file_line) {   # the explicit variable name
>              print FILENAME, from_line       # makes this test obvious
>              print FILENAME, xref_line
>              at_first_matched_file_line = 0  # set to FALSE
>          }
>          print FILENAME, $0
>      }
>  ...

> Is this clearer?

That is much easier to follow... thanks.  I think I'm finally catching
on how these newly appearing variables can be used to make decisions.
The mere words `BOOLEAN value' had thrown me for a bit of a loop.

You've been very patient, with the explanations.  I appreciate that.
But have also given full and detailed information to each query.  I
appreciate that too.  I may not have digested the information very
well but am catching on now.

You may not appreciate how poorly prepared for this kind of reasoning
and interconnected meanings, a life time of heavy construction work
and a 9th grade education can leave one.

Quote:

> > 2)

> > It seems to be indicated from the thread about this script that it
> > needs an END clause to process the last file.

> You only need an END clause if you have end-of-all-input termination
> processing to do. What termination processing is there to do in
> the script above? None. You might also ask if you have end-of-file
> termination processing to do. Do you? No. You only have initialization
> code executing at the beginning of each file (wherever FNR == 1). The
> reason this is true is because I conveniently ignored a detail--though
> not entirely. Read on.

The very first response to my original query was the source of this
confusion on my part.

Quote:
> Two idioms come to mind that you may find useful:
>   FNR == 1 && FNR != NR {
>       # do end-of-file processing for first through second-to-last files
>       # if there are multiple files
>   }

That line... " first through second to last"  indicates the last file
will not be processed.  This technique was quietly dropped as the
thread progressed.  It wasn't totally clear to me if it still
obtained, but I see now that it doesn't.

It was replaced by code more appropriate for this usage:

Quote:
>    if (FNR == 1) {
>        in_header = 1
>    } else if (in_header && /^$/) {

        in_header = 0

Quote:
>    }

Aside: An interesting note is that the above first method would give
       some inaccurate results, when xargs is used as in the script
       below.  Apparently xargs passes a chunk of files (around 500)
       and the next chunk zeroes out NR.  I ran a test with an NR
       counter in place, it zeroed out at between 460 and 480
       repeatedly over 42,800 incoming files.

Quote:

> > Printing a trailing record terminator token ("-- ") is simply a
> > matter of testing the status of the previous message each time you
> > read a new file and at the end of all input. I've already posted
> > examples of idioms to do just that.

> If, after all that has been posted, you cannot do this, then you're
> either not trying hard enough to learn to program or you're trying
> too hard to learn to not program.

That part was not a problem.  I see how to do it.  I guess I don't
really see why it needs to be a `trailing' separator.  I only included
it to make human scanning a little easier and prepending it to the
first printing of FILENAME seems to do that just as well, including if
output is `appended' to a file.

I've  changed some things that seem to have improved the look of the
output, probably will look clumsy and oafish to you but the heart is
still your clear and useful code.

The actual working program now is a shell script that calls awk.  It
seemed handier to use the positional shell variables $1 $2 etc to
set the initial regular expressions.

This search tool is quite usefull and since it is fully based on
regular expressions, unlike glimpse or freeWAIS (data base indexing
tools), it can conduct very exact searches or be used to retrieve a
full header by setting a bland RE like "^Subject: "

Somewhat slow by comparison to freeWAIS especially but since it uses
no indexing that seems ok and normal.  No fuzzy searching with this
tool. Once the regular expressions are adjusted to suit a query, it
is very precise.

Its not all that slow either.  It processed 42,800 files in a little
over a minute.  Not bad considering the overhead of tracking down 3 RE
in each file.

The working code is called like this:

search.sh  "Head-RE" "Head-RE" "Body-RE"   /path/directory

(I left the comments in place in case any one is willing to slog
through it and  cares to comment or sees
something incorrect... may be a bit lengthy)

 8<-----------

#!/bin/sh
## Name = search.sh
## Disclaimer:... This tool makes no claim to being fast.  But is capable
## of precise full regular expression based searches.


## from Newsgroup comp.lang.awk and comp.unix.questions.  Any
## clumsyness, errors or blunderings are my own.

##    The comments below try to give enough information so that any one
##    with a little unix experience can use this tool

## This script is designed to search mail and news messages in `one
## message per file' format. It will NOT work on spool style files.

## The script expects to be given 3 regular expressions and a
## `/path/directory' on the command line in this format:

##    $ search.sh  "RE1" "RE2" "RE3" /path/directory

## Example:
## $ search.sh "^Subject:.*whiskey" "^Message-ID: " "drunken.*brawl" ~/Mail

## Three regular expressions are required or no hits will be returned.

## A `find' command is run on the directory given (~/Mail) and finds
## only numeric file names.

## Above example will return hits if:  A file contains `whiskey' in the
## Subject line.  Has a Message-ID: line and has a line in the body
## containing the string `drunken (plus any number of any characters
## [except newline]) brawl'

## Hits are only returned if all three regular expressions match in
## one file.  Regular expression 1 and 2 are expected to be things
## found in message headers,and will not match if found in the message
## body.  Similarly, regular expression 3 is expected to be something
## found in the message body.  Like grep, it will only match on a
## single line.

## If the user is not looking for exact information in both header
## regular expressions, then use one or both to retrieve the
## information in that header line by using a bland regular expression
## like "^Message-ID: " in the example.

## Output:
## Each file that matches causes the script to output several lines.
## The command below is searching for examples of `for' loops used in
## scripts, under the `comp' hierarchy (of my limited collection).  It
## retrieves the filename, Newsgroup name and Message ID for quick
## retieval, if one wants to view the whole message. Plus, it
## retrieves the instances in matched messages of the `body' regular
## expresion.

## Anatomy of an example:

##  $ search.sh "^Newsgroups: comp\.(unix|lang)\.(questions|shell|awk)" \
##    "^Message-ID: " "for *\\\(" /bak/n2m/comp/
## ( should be all one line on the command line)

## Note regular expressions are inclosed in double quotes First
## regular expression will find only messages with `Newsgroups: '
## header lines that contain (on my collection) comp.unix.questions,
## comp.unix.shell or comp.lang.awk as the first entry.

## Second regular expression will retrive any information in a header
## line That begins with Message-ID: (usefull for quick retrieval of
## full message)

## Third regular expression will find any instance of `for' followed
## by zero or any number of spaces, followed by an opening
## paren. Provided the first two regular expressions matched.

## Random samples of the output:

## --
## /bak/n2m/comp/lang/awk/11751
## 11751 Newsgroups: comp.lang.awk

## 11751:39:for *\(:awk '{ for (i=3;i<=NF;i++) printf $i" ";print ""}' junk

## __
## /bak/n2m/comp/lang/awk99/1349
## 1349 Newsgroups: comp.lang.awk,comp.unix.shell,comp.unix.questions

## 1349:26:for *\(:awk '{for (i=1; i<=MAX1; i++)
## 1349:28:for *\(: awk '{for (j=1; j<=MAX2; j++)
## 1349:30:for *\(:  awk '{for (k=1; k<=MAX3; k++)

## --
## /bak/n2m/comp/unix/shell/67685
## 67685 Newsgroups: comp.unix.shell

## 67685:46:for *\(:       for(i=1;i<=l;i++){

## First line of output is the full file name

## Second line is the last component of filename and the information
## captured by the first regular expression.

## Third line contains last component of filename and the information
## retieved by the second regular expression.

## Fourth and subsequent line format contains: last part of
## filename:line number:regexp used:information captured by third
## regular expression

## NOTE: The third (body) regexp my occur more than once in a matched
## message and will be printed accordingly.

## It's handy having the regular expression reprinted there so you can
## get a look at what awk is using compared to what was inserted.  If
## we were looking for examples of `for loops' we've hit the jackpot.
## There were hundreds in the 4556 lines of output

## Notes: to search for certain non alphanumeric characters like
## `|',`(',`\' (ie characters that have special meaning in regular
## expressions) they can either be triple escaped `\\\(' or bracketed
## and escaped `[\(]'.

## IMPORTANT NOTE: Gnu find does not traverse symlinks by default.  The
##  -follow flag must be used if the directories have symlinks

find $4 -name '*[0-9]' -follow ...

read more »



Fri, 17 Jan 2003 03:00:00 GMT  
 Handy way to id the end of file [revisited]

Quote:

> Its not all that slow either.  It processed 42,800 files in a little
> over a minute.  Not bad considering the overhead of tracking down 3 RE
> in each file.

I'd say it's much better than "not bad", and most of the time is
probably consumed by your harddisk reading all these files.

How big are these files? If they were only 1 KB each, you had 42.8 MB.
And these small files are not located in one nearly contiguous chunk on
your harddisk, they are usually scattered all over the place, so you
really can't hope that you can read these files with say 10 MB per
second.

Maybe you want to try a

time (find . -name '*[0-9]' -follow | xargs cat ;) >/dev/null

in that same directory and tell us how long it took.

Regards...
                Michael



Sat, 18 Jan 2003 03:00:00 GMT  
 Handy way to id the end of file [revisited]

Quote:


> > Its not all that slow either.  It processed 42,800 files in a little
> > over a minute.  Not bad considering the overhead of tracking down 3 RE
> > in each file.

> I'd say it's much better than "not bad", and most of the time is
> probably consumed by your harddisk reading all these files.

> How big are these files? If they were only 1 KB each, you had 42.8 MB.
> And these small files are not located in one nearly contiguous chunk on
> your harddisk, they are usually scattered all over the place, so you
> really can't hope that you can read these files with say 10 MB per
> second.
> time (find . -name '*[0-9]' -follow | xargs cat ;) >/dev/null

The command you listed must not be available on my bash setup.
   time (find . -name '*[0-9]' -follow | xargs cat ;) >/dev/null
   bash: syntax error near unexpected token `(find'

Here is some more acurate information        

The top directory weighs in at 189 MB  all usenet posts of various
sizes.
   `du -sh /var/n2m'
189M    /bak/n2m

Timing the search tool with these RE:
  (no -follow flag needed)
  date;search.sh "^Newsgroup: comp\.lang\.awk" "^Subject: "\
   "for *\\\(" /bak/n2m >/dev/null 2>&1  && date

  Tue Aug  1 17:44:21 PDT 2000
  Tue Aug  1 17:45:35 PDT 2000
Total 1 minute and 14 secondes

Timing the suggested command:

  $ date;find /bak/n2m -name '[0-9]*' | xargs cat >/dev/null 2>&1 && date
   (no -follow flag needed)

  date;find /bak/n2m -name '[0-9]*' | xargs cat >/dev/null 2>&1 && date
  Tue Aug  1 17:57:45 PDT 2000
  Tue Aug  1 17:58:45 PDT 2000
Total 1 minute



Sat, 18 Jan 2003 03:00:00 GMT  
 Handy way to id the end of file [revisited]

Quote:

...
> >   FNR == 1 && FNR != NR {
> >       # do end-of-file processing for first through second-to-last files
> >       # if there are multiple files
> >   }

> That line... " first through second to last"  indicates the last file
> will not be processed.  This technique was quietly dropped as the
> thread progressed.  It wasn't totally clear to me if it still
> obtained, but I see now that it doesn't.

This clause will not do _end-of-file processing_ for the last file.
The contents of the last file are still processed (as contents).
This clause fires on the first line of each file other than the first;
thus on file 2 it does end-of-file processing for file 1, on file 3 for file 2,
... on the last file for file N-1.  It can't do end-of-file processing
for the last file because there would have to be a file after
the last file, which is obviously impossible.  Thus you must put
end-of-file processing for the last file in the END block; you can
just have both of those call a function that does the real work.

As noted elsewhere your application doesn't need end-of-file processing,
only beginning-of-file processing, which you got with FNR==1 ...

...

Quote:
> Aside: An interesting note is that the above first method would give
>        some inaccurate results, when xargs is used as in the script
>        below.  Apparently xargs passes a chunk of files (around 500)
>        and the next chunk zeroes out NR.  I ran a test with an NR
>        counter in place, it zeroed out at between 460 and 480
>        repeatedly over 42,800 incoming files.

What xargs foo does is break its stdin into chunks and execute
foo once for each chunk.  Each execution of a perl script has
all variables (including NR) completely separate from other executions.
xargs exists because most Unices have a limit on the amount of
argument data that can be passed to (one execution of) a program,
but often it is more efficient to do "several" (more than one,
but less than all) files or other arguments in each execution.

--
- David.Thompson 1 now at worldnet.att.net



Thu, 23 Jan 2003 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Handy way to id the end of file

2. Handy Tools Page Announces Handy Tools for Clarion Windows

3. end-of-record versus end-of-file?

4. select to id end - one char missing

5. end of file for large files

6. NEWBIE TIP: Reading a file two different ways...from ENGSOL

7. Different ways to read in a file

8. Help files - revisited

9. select count(*) from test_file where id>100 and id<1000

10. Clipper, FILES, Windows XP revisited

11. Object IDs are good ( was: Object IDs are bad )

12. Testing if ID is a VALID TSO ID

 

 
Powered by phpBB® Forum Software