Newbie question -- to make an awk program act on a collection of files 
Author Message
 Newbie question -- to make an awk program act on a collection of files

Hi.

This is perhaps more of a shell question than an awk specific one, but I'm
fairly new to both, and am still struggling for a solution after quite a
while of googling for "(awk|shell) tutorial multiple files" etc.

I have a working awk script ("cleantest.awk") which I can invoke on one file
at a time:

BEGIN {}
NR<2 {print" " > "../results_new/"FILENAME;}
NR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
NR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
"../results_new/"FILENAME;}
END {}

and it creates a copy of the file with the same filename in a different
directory.

What I want to do is to make this act on a hundred or so files at a time,
creating a hundred or so new files with the same names in the directory
"results_new".

I've tried things like

find . -name "*.txt" | awk -f cleantest.awk [resulting file is a list of
filenames  :o( ]
cat `find . -name "*.txt"` | awk -f cleantest.awk [resulting file contains
all the text I want, but in one huge file called "-" instead of lots of
little ones]

If somebody could supply me with a solution, or better yet a link to a good
tutorial which will explain why said solution works and mine doesn't, it'd
be a big help.

Thanks aplenty.

Tim



Sat, 21 May 2005 18:01:07 GMT  
 Newbie question -- to make an awk program act on a collection of files



Quote:
> Hi.

> This is perhaps more of a shell question than an awk specific one, but I'm
> fairly new to both, and am still struggling for a solution after quite a
> while of googling for "(awk|shell) tutorial multiple files" etc.

> I have a working awk script ("cleantest.awk") which I can invoke on one
file
> at a time:

> BEGIN {}
> NR<2 {print" " > "../results_new/"FILENAME;}
> NR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
> NR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
> "../results_new/"FILENAME;}
> END {}

> and it creates a copy of the file with the same filename in a different
> directory.

> What I want to do is to make this act on a hundred or so files at a time,
> creating a hundred or so new files with the same names in the directory
> "results_new".

Your awk problem is the difference between NR and FNR.
FNR is set on a per file basis, NR is an overall count.
(And you do not need those empty BEGIN and END blocks.)
And FILENAME is "-" when you pipe to awk's standard input,
as you have discovered.

However, this is not the right way to go about this.
What is wrong with cp? cp file1, file2, file3, ..., fileN directory

And btw find . -name should have single rather than double quotes so
that the shell will not expand the *.txt in your working directory.

What do you really want to do?

John.



Sat, 21 May 2005 19:05:18 GMT  
 Newbie question -- to make an awk program act on a collection of files

Quote:

> What do you really want to do?

> John.

Thanks for the tips; I'll try to describe the problem I'm attempting to
solve a bit better.

Quote:

> However, this is not the right way to go about this.
> What is wrong with cp? cp file1, file2, file3, ..., fileN directory

The awk script is meant to "clean up" a collection of about 100 text files
(between 1K and 100K each) which are generated as the results of an
application test procedure. Ideally I would overwrite the original files
with the modified version, but I had trouble doing that (as well!) and
that's how I ended up at the "create a new file in a different place with
the same name" option.

As I said, it works on a single file; if I do:

     awk -f cleantest.awk results1.txt

I get a new file generated called results1.txt that has all the right things
stripped out of it. What's missing is the right thing to type to do the same
operation on every text file in my subdirectory tree rather than one at a
time.

Thanks again for your help.

Tim



Sat, 21 May 2005 19:19:09 GMT  
 Newbie question -- to make an awk program act on a collection of files



Quote:

> > What do you really want to do?

> > John.

> Thanks for the tips; I'll try to describe the problem I'm attempting to
> solve a bit better.

> The awk script is meant to "clean up" a collection of about 100 text files
> (between 1K and 100K each) which are generated as the results of an
> application test procedure. Ideally I would overwrite the original files
> with the modified version, but I had trouble doing that (as well!) and
> that's how I ended up at the "create a new file in a different place with
> the same name" option.

> As I said, it works on a single file; if I do:

>      awk -f cleantest.awk results1.txt

> I get a new file generated called results1.txt that has all the right
things
> stripped out of it. What's missing is the right thing to type to do the
same
> operation on every text file in my subdirectory tree rather than one at a
> time.

Try the -exec option to find to run the same command on each found file
(one at a time), or change your script to use FNR rather than NR. Then
you can use "find ... |xargs awk -f cleantest.awk" where xargs bundles
up the filenames it gets from find (or ls or wherever) and gives them
to awk which is more efficient.

Once you have got your script right, and are confident you will never
need your original files again, you can have your script write to
a temporary file, then copy that new (temporary) file over the original
one.

John.
John.



Sat, 21 May 2005 19:32:08 GMT  
 Newbie question -- to make an awk program act on a collection of files
Hello,

Quote:

> I have a working awk script ("cleantest.awk") which I can invoke on one file
> at a time:

> BEGIN {}
> NR<2 {print" " > "../results_new/"FILENAME;}
> NR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
> NR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
> "../results_new/"FILENAME;}
> END {}

The first and last line does nothing, so they should not be there.

I suppose you call the script like this:

        awk -f cleantest.awk file.txt

You can simply do this:

        awk -f cleantest.awk *.txt

when you have say 10 files, this will work, the FILENAME variable will
automatically change.  There is one problem: NR is computed globally,
so the begining of second file won't have NR<2.  The solution is to use
FNR instead of NR:

FNR<2 {print" " > "../results_new/"FILENAME;}
FNR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
FNR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
"../results_new/"FILENAME;}

If you have more files, there may be problems with open files.
You have to tell awk when you are done with particular open file:

FNR == 1 { if (old_filename != "") close(old_filename) }
END { close(old_filename) }
{ old_filename = FILENAME }
FNR<2 {print" " > "../results_new/"FILENAME;}
FNR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
FNR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
"../results_new/"FILENAME;}

Then  awk -f cleantest.awk *.txt  will work.

Let me propose also this change to the code:

FNR == 1 { if (old_filename != "") close(old_filename) }
END { close(old_filename) }

FNR<2 { $0 = ""     # $0=" " Do you really need the space?

Quote:
}

FNR>13 && $8=="tartimis:offset" { NF=7 }

{
  print >("../results_new/" FILENAME
  old_filename = FILENAME

Quote:
}
> find . -name "*.txt" | awk -f cleantest.awk [resulting file is a list of
> filenames  :o( ]

find prints the filenames, awk reads them on the input.
It's roughly the same as if you did this:

find . -name "*.txt" >list.txt
awk -f cleantest.awk list.txt

Quote:
> cat `find . -name "*.txt"` | awk -f cleantest.awk [resulting file contains
> all the text I want, but in one huge file called "-" instead of lots of
> little ones]

awk reads concatenated files from standard input.  It's roughly the same as

cat *.txt >all.txt
awk -f cleantest.awk all.txt

The resulting file has name "-", not "all.txt" as there is no "all.txt",
the data just go through pipe to std. input of awk.  If awk reads from std.
input, it uses FILENAME=="-" since no real name is available.

Quote:
> If somebody could supply me with a solution, or better yet a link to a good
> tutorial which will explain why said solution works and mine doesn't, it'd
> be a big help.

OK, I hope this helped.

There are also shell solutions, eg.:

for f in *.txt; do awk -f cleantest.awk "$f"; done

Stepan Kasal



Sat, 21 May 2005 19:57:40 GMT  
 Newbie question -- to make an awk program act on a collection of files


Quote:
>Hi.

>This is perhaps more of a shell question than an awk specific one, but I'm
>fairly new to both, and am still struggling for a solution after quite a
>while of googling for "(awk|shell) tutorial multiple files" etc.

>I have a working awk script ("cleantest.awk") which I can invoke on one file
>at a time:

>BEGIN {}
>NR<2 {print" " > "../results_new/"FILENAME;}
>NR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
>NR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
>"../results_new/"FILENAME;}
>END {}

>and it creates a copy of the file with the same filename in a different
>directory.

>What I want to do is to make this act on a hundred or so files at a time,
>creating a hundred or so new files with the same names in the directory
>"results_new".

(BTW, I'm assuming platform == "Unix", since you are invoking "find", but
these days, one basically assumes Windoze unless explicitly specified.  The
times, they are a changin'...)

Two things I try to avoid in my day to day life:
        1) Using redirection in AWK (if I can help it)
        2) Using the "find" cmd in Unix (if I can help it)

So, how about:

$ cat foo.awk
NR<2 { print " " }
NR>13 { if ($8=="tartimis:offset") print $1,$2,$3,$4,$5,$6,$7
        else print }
$ for i in *.txt

Quote:
> do gawk -f foo.awk $i > ../results_new/$i
> done

$

A couple of other notes:
        1) The "for i in *.txt" thing will fail if you have filenames with
           spaces in them - and gets really ugly if you have filenames with
           newlines in them.  Check the shell group for the usual incantations
           for this.
        2) If your real goal is just to remove "tartimis:offset" from the
           string, you might use the sub() function instead.



Sat, 21 May 2005 21:46:26 GMT  
 Newbie question -- to make an awk program act on a collection of files
Thanks for your help and your suggestions. In fact, what your last
suggestion shows is that I should have clearly stated the problem I'm trying
to solve before worrying about how to solve it, either with awk or
otherwise. I shall attempt to do so now, if you don't mind bearing with me.

As I said in an earlier message, the test procedure for an application in
development results in the creation of about 100 text files. To tidy the
files up a bit before using them, what I want to do is:
  (1) remove everything up to an including the line which ends with the
string "todays_date: 20021203"; this is currently the first thir{*filter*} lines,
hence the "NR>13" in my script, but if it can be done by detecting the
string, this would be better
  (2) in any line which contains "tartimis:offset" (which always appears in
the 8th space-separated field), delete this string and the rest of the line
Ideally I want to launch the process and have it tidy up the files "in
situ", without creating a second copy of them all.

By the way, I'm actually using a Win2k box, but it's got Cygwin on it, hence
the availability of find, awk, et al.

Maybe awk isn't the best tool for the job... I'd be grateful for help in
single syllable words, or for a link to a good tutorial to sort me out.

Thanks again.

Tim



Quote:



> >Hi.

> >This is perhaps more of a shell question than an awk specific one, but
I'm
> >fairly new to both, and am still struggling for a solution after quite a
> >while of googling for "(awk|shell) tutorial multiple files" etc.

> >I have a working awk script ("cleantest.awk") which I can invoke on one
file
> >at a time:

> >BEGIN {}
> >NR<2 {print" " > "../results_new/"FILENAME;}
> >NR>13 && $8!="tartimis:offset" {print > "../results_new/"FILENAME;}
> >NR>13 && $8=="tartimis:offset" {print $1,$2,$3,$4,$5,$6,$7 >
> >"../results_new/"FILENAME;}
> >END {}

> >and it creates a copy of the file with the same filename in a different
> >directory.

> >What I want to do is to make this act on a hundred or so files at a time,
> >creating a hundred or so new files with the same names in the directory
> >"results_new".

> (BTW, I'm assuming platform == "Unix", since you are invoking "find", but
> these days, one basically assumes Windoze unless explicitly specified.
The
> times, they are a changin'...)

> Two things I try to avoid in my day to day life:
> 1) Using redirection in AWK (if I can help it)
> 2) Using the "find" cmd in Unix (if I can help it)

> So, how about:

> $ cat foo.awk
> NR<2 { print " " }
> NR>13 { if ($8=="tartimis:offset") print $1,$2,$3,$4,$5,$6,$7
> else print }
> $ for i in *.txt
> > do gawk -f foo.awk $i > ../results_new/$i
> > done
> $

> A couple of other notes:
> 1) The "for i in *.txt" thing will fail if you have filenames with
>    spaces in them - and gets really ugly if you have filenames with
>    newlines in them.  Check the shell group for the usual incantations
>    for this.
> 2) If your real goal is just to remove "tartimis:offset" from the
>    string, you might use the sub() function instead.



Sat, 21 May 2005 22:43:24 GMT  
 Newbie question -- to make an awk program act on a collection of files
(My comments are below)



Quote:
>Thanks for your help and your suggestions. In fact, what your last
>suggestion shows is that I should have clearly stated the problem I'm trying
>to solve before worrying about how to solve it, either with awk or
>otherwise. I shall attempt to do so now, if you don't mind bearing with me.

>As I said in an earlier message, the test procedure for an application in
>development results in the creation of about 100 text files. To tidy the
>files up a bit before using them, what I want to do is:
>  (1) remove everything up to an including the line which ends with the
>string "todays_date: 20021203"; this is currently the first thir{*filter*} lines,
>hence the "NR>13" in my script, but if it can be done by detecting the
>string, this would be better

This is usually done like this:
--- Cut Here ---
flag || (flag = /todays_date: 20021203/) {next}
--- Cut Here ---

Quote:
>  (2) in any line which contains "tartimis:offset" (which always appears in
>the 8th space-separated field), delete this string and the rest of the line

This is usually done like this:
--- Cut Here ---
{ sub(/tartimis:offset.*/,"") }
--- Cut Here ---

Quote:
>Ideally I want to launch the process and have it tidy up the files "in
>situ", without creating a second copy of them all.

The question of "in situ" editing comes up frequently in the shell groups
and the short summary is: There is no such thing.  Perl emulates it (but
behind the scenes, it is just doing the "output to a temp file and rename
it to the original [deleting the original]" routine).

"ed" (the standard Unix editor) does it, but it is really using some kind
of temp file (or loading it all in memory) as well.  The jury is out on
whether ed blows up if it can't load it all into memory or if it pages to
disk.

Now, you can do "in situ" in AWK (and I have done so), by loading all the
lines into an array, and then writing the array out to the file (using the
FILENAME variable) in the END block.  It works as long as your file isn't
in the gigabyte range and as long as you understand the risk.

Quote:
>By the way, I'm actually using a Win2k box, but it's got Cygwin on it, hence
>the availability of find, awk, et al.

That's fine.  Close enough to Unix for our purposes.  In particular, it
means you have a shell, so the (ba)sh examples are relevant.

Quote:
>Maybe awk isn't the best tool for the job... I'd be grateful for help in
>single syllable words, or for a link to a good tutorial to sort me out.

AWK is a fine tool.  I think you are on the right track.

Quote:
>Thanks again.

Glad to have helped.


Sat, 21 May 2005 23:49:25 GMT  
 Newbie question -- to make an awk program act on a collection of files

Quote:

>This is usually done like this:
>--- Cut Here ---
>flag || (flag = /todays_date: 20021203/) {next}
>--- Cut Here ---

Oops!  That does the opposite (deletes from that line forward)

I leave fixing it as an exercise for the reader.



Sat, 21 May 2005 23:59:21 GMT  
 Newbie question -- to make an awk program act on a collection of files
Thanks very much for your suggestions. I would like to understand this
"flag" suggestion though, and I'm having trouble so far.

As I understand it, the line quoted below means "if 'flag' is true OR if
'flag' matches the regexp /todays_date:20021203/ then skip this line". I
don't know what 'flag' is though; a special internal variable? [Don't think
so.] An ordinary variable? In which case... no I'm sorry, I'm stuck. If you
could throw another ten minutes
my way, I'd be grateful.

Thanks.

Tim



Quote:


> >This is usually done like this:
> >--- Cut Here ---
> >flag || (flag = /todays_date: 20021203/) {next}
> >--- Cut Here ---

> Oops!  That does the opposite (deletes from that line forward)

> I leave fixing it as an exercise for the reader.



Tue, 24 May 2005 23:27:23 GMT  
 Newbie question -- to make an awk program act on a collection of files
Hello,

Quote:

> Thanks very much for your suggestions. I would like to understand this
> "flag" suggestion though, and I'm having trouble so far.
> As I understand it, the line quoted below means "if 'flag' is true OR if
> 'flag' matches the regexp /todays_date:20021203/ then skip this line".

that would be written as "flag || (flag ~ /todays_date:20021203/)"

In awk, when you write regular expression, it means "match it against the input
line".  So we write rules this way:

/todays_date: 20021203/ { ... }

and the rule is executed iff the input line mathches, ie. as if you wrote

$0 ~ /todays_date: 20021203/ { ... }

Similarily, you can write

        if (/todays_date: 20021203/) {...}

inside any rule or function and it's the same as

        if ($0 ~ /todays_date: 20021203/) {...}

You can even write

        var = /todays_date: 20021203/

instead of

        var = ($0 ~ /todays_date: 20021203/)

And here we come to the end; the cryptic program

        flag || (flag = /todays_date: 20021203/) {next}
        {print}

which is equivalent to

        flag || (flag = ($0 ~ /todays_date: 20021203/)) {next}
        {print}

At the beginning, flag is uninitialized, so the first line the input
line is matched against the regex.  The result is saved in flag.

If the input line doesn't match, the program continues with next rule.

If it matches, flag is 1, the line is skipped and all the following
lines are skipped as well.

Hope this helps,
        Stepan Kasal



Fri, 27 May 2005 14:03:03 GMT  
 Newbie question -- to make an awk program act on a collection of files
[cc since I think my news server is not propagating]



[...]
% files up a bit before using them, what I want to do is:
%   (1) remove everything up to an including the line which ends with the
% string "todays_date: 20021203"; this is currently the first thir{*filter*} lines,
% hence the "NR>13" in my script, but if it can be done by detecting the
% string, this would be better

This sounds like a case where awk's range operator will do what you
want (which it hardly ever does).

  FNR == 1, /todays_date:/ { next }

this ignores every line of every file from the first until an occurrance
of `todays_date:'.

%   (2) in any line which contains "tartimis:offset" (which always appears in
% the 8th space-separated field), delete this string and the rest of the line

you can get rid of this using sub:

    { sub(/tartimis:offset.*/, "")
      print > file
    }

for more complex situations, you can use match to match the first
seven fields, then print a substring.

% Ideally I want to launch the process and have it tidy up the files "in
% situ", without creating a second copy of them all.

To write `in place', you can write to a temporary file, then delete
the original and rename it. On a Unix system, this can be done while
the original file is being read (i.e., you can open it, then delete
it and start writing to the new file).

  FNR == 1 { if (file != "") {
                close(file)
                system("mv -f " file " " origfile)
              }
              origfile = FILENAME
              file = FILENAME ".new"
  }

  FNR == 1, /todays_date:/ { next }

  { sub(/tartimis:offset.*/, "")
    print > file
  }

even if writing to a new directory, I'd recommend putting the
new file name in a variable.
--

Patrick TJ McPhee
East York  Canada



Sun, 22 May 2005 14:25:31 GMT  
 Newbie question -- to make an awk program act on a collection of files
[cc due to news propagation problems]


% As I understand it, the line quoted below means "if 'flag' is true OR if
% 'flag' matches the regexp /todays_date:20021203/ then skip this line". I

No. It means if flag is true, then go to the next record and start testing
patterns again. Otherwise, test the regexp /todays_date:200021203/ against
$0 and assign the result to flag, then if it's true, go to the next record.

As Kenny points out, this is the opposite of what you want.

% don't know what 'flag' is though; a special internal variable? [Don't think

It's just an ordinary variable. It will start off with the value "", which
means false in a test.

 flag || (flag = /re/) { next }

is equivalent to, but more efficent than

  fl { next }
  !fl { if ($0 ~ /re/) fl = 1
        else fl = 0
      }
  fl { next }
--

Patrick TJ McPhee
East York  Canada



Wed, 25 May 2005 00:40:45 GMT  
 Newbie question -- to make an awk program act on a collection of files
Just a note to say thanks to all who helped me out with this problem, in
particular Kenny and Patrick. It works!

Cheers.

Tim



Quote:
> [cc due to news propagation problems]



> % As I understand it, the line quoted below means "if 'flag' is true OR if
> % 'flag' matches the regexp /todays_date:20021203/ then skip this line". I

> No. It means if flag is true, then go to the next record and start testing
> patterns again. Otherwise, test the regexp /todays_date:200021203/ against
> $0 and assign the result to flag, then if it's true, go to the next
record.

> As Kenny points out, this is the opposite of what you want.

> % don't know what 'flag' is though; a special internal variable? [Don't
think

> It's just an ordinary variable. It will start off with the value "", which
> means false in a test.

>  flag || (flag = /re/) { next }

> is equivalent to, but more efficent than

>   fl { next }
>   !fl { if ($0 ~ /re/) fl = 1
>         else fl = 0
>       }
>   fl { next }
> --

> Patrick TJ McPhee
> East York  Canada




Mon, 06 Jun 2005 22:58:05 GMT  
 
 [ 15 post ] 

 Relevant Pages 

1. AWK newbie is looking for a AWK help with his 1st program

2. Making awk programs look pretty

3. Messy Messy..Making awk think a string is a list of files

4. An Awk Program to Create an Awk Program [Long]

5. Making return key act like tab key

6. newbie question - tcl/Tk file monitor/watch program

7. newbie needs help making a program

8. Newbie Awk Program 1

9. Newbie Awk Program 2

10. Using NASM to making files for DJGPP (newbie)

 

 
Powered by phpBB® Forum Software