newbie question 
Author Message
 newbie question

Hi,

At my work I need to format a lot of text files. One of my co-workers
pointed me to awk utility. I'm totally new to this. Could someone help
me with really simple script?

Description of the task:

1. If line contains n spaces at beginning, then replace it by TAB
(\t).
2. If line contains in the middle more than one continuous spaces,
then replace all spaces by one space char.
3. If next line does not start from n spaces (or TAB after step 1),
then replace CR (\n) by space.
4. If line contains only CR (\n), then skip it.

I wrote this to do steps 1 & 2:

{ sub(/^     /, "\t") } # n = 5 spaces
{ sub(/ +/, " ") }
{ print }

However, second line has no effect. :( Also, I have no any idea how to
define next line (step 3).

Thanks in advance
Alex



Sat, 13 Dec 2003 17:50:47 GMT  
 newbie question

Quote:
> 1. If line contains n spaces at beginning, then replace it by TAB
> (\t).
> 2. If line contains in the middle more than one continuous spaces,
> then replace all spaces by one space char.
> 3. If next line does not start from n spaces (or TAB after step 1),
> then replace CR (\n) by space.
> 4. If line contains only CR (\n), then skip it.

> I wrote this to do steps 1 & 2:

> { sub(/^     /, "\t") } # n = 5 spaces
> { sub(/ +/, " ") }
> { print }

> However, second line has no effect. :( Also, I have no any idea how to
> define next line (step 3).

Yes, it does, but the sub function does only one replacement. Use the
gsub function to replace any occurence of the regular expression in the
string.

To code step 3, you need to use records which include several lines. I
would suggest you use '\n' as field separator and '\n\n+' as record
separator. This means that what you called a line will become a field
(residing in $1, $2, $3,...) and that you won't have anything to do to
apply step 4, this will be automagically done :)

This yields however a more complicated program :

BEGIN { FS="\n"; RS="\n\n+" }
{
  # first, a loop over all the lines (ie. fields) of the record
  # for step 1
  for (i=1; i<NF; i++) {
    sub(/^     /, "\t", $i)
  }

  # step 2
  gsub(/ +/, " ")

  # step 3
  gsub(/\n/, " ")

  # nothing to do for step 4

  # final printing
  print

Quote:
}

Now, there is a way to combine the two gsub instructions for steps 2 and
3, I suppose it is faster :

gsub(/\n|  +/, " ")

--
BBP



Sat, 13 Dec 2003 17:59:07 GMT  
 newbie question

Quote:

> Hi,
> At my work I need to format a lot of text files. One of my co-workers
> pointed me to awk utility. I'm totally new to this. Could someone help
> me with really simple script?
> Description of the task:
> 1. If line contains n spaces at beginning, then replace it by TAB
> (\t).

 /^        / { sub(/^[ ]*/, "\t") }  # "n spaces" are between ^ and /

Quote:
> 2. If line contains in the middle more than one continuous spaces,
> then replace all spaces by one space char.

        { gsub(/  */," ") }  # 2 spaces between / and *

Quote:
> 3. If next line does not start from n spaces (or TAB after step 1),
> then replace CR (\n) by space.

        This one is hard to understand.  Do you mean to join
        lines where the next line is *not* indented?

        Restated:
          If next line starts with anything other than a [Tab]
          (since we've already changed n-spaces with that) then
          join it to the current line.  (???)

        If we need such a read ahead in awk we save the current
        line and defer output until later (when the next line
        is the current line):

        NR==0 { l=$0 }

        /^\t/ { print l; l=$0; next }  # this line DOES start with
                # tab so print last line; set new line; and skip
                # rest of the awk processing for this line.

        { l=l $0 }  # this line must start with something other
                # than a tab (since we did a next in the last case)
                # so we join the two lines

        END { print l }  # at end, print the last line (possibly
                # a concatenation of many lines)

Quote:
> 4. If line contains only CR (\n), then skip it.

        Insert the following at the top of the script:

        /^$/ { next }

Quote:
> I wrote this to do steps 1 & 2:
> { sub(/^     /, "\t") } # n = 5 spaces
> { sub(/ +/, " ") }
> { print }
> However, second line has no effect. :( Also, I have no any idea how to
> define next line (step 3).
> Thanks in advance
> Alex

        Bringing all of my steps together (and re-writing
        the comments) I get:

        # skip blank lines:
                /^$/ { next }
        # leading n-spaces replaced with a single tab:
                /^        / { sub(/^[ ]*/, "\t") }  
        # remaining sequences of spaces squeezed to one space each:
                { gsub(/  */," ") }
        # look ahead to join lines that don't start with tabs:
           # first line we just remember (and skip to next):
                NR==0 { l=$0; next }
           # if this (non-first) line starts with a tab,
           # don't join it to last last line, just print it;
           # set next "lastline" and skip to next:
                /^\t/ { print l; l=$0; next }  
           # getting past the "next" means this line start
           # with something other than a tab so join last and this:
                { l=l $0 }  

        # implicitly we're done with this line;

        END
                # after all lines; print the last line
                # (possibly a concatenation of many joined lines)
                { print l }  

        ... which could probably be shortened and optimized,
        and which is probably slightly different than your
        requirements (due to ambiguity).

        However that's probably a pretty good start.  Here's
        the short form (no comments) (seven lines). (The first
        three lines knock out one of your stated requirements,
        each; two the other four lines handle two conditions
        --- lines that started with tabs or not; one line sets
        our state for look ahead; and one cleans up after us.

                /^$/ { next }
                /^        / { sub(/^[ ]*/, "\t") }  
                { gsub(/  */," ") }
                NR==0 { l=$0; next }
                /^\t/ { print l; l=$0; next }  
                { l=l $0 }  
                END { print l }  



Wed, 17 Dec 2003 15:31:27 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Newbie Question (Was: Newbie Question...)

2. Not a newbie, but a newbie question...

3. Trivial Newbie Question (Newbie)

4. Newbie Question: Realbasic Question

5. Yet another question on system calls(newbie question)

6. A few more newbie questions..

7. Newbie question about worth of specializing in Smalltalk

8. Very Newbie questions

9. Newbie question

10. newbie questionS

11. Another Newbie Question

12. newbie question

 

 
Powered by phpBB® Forum Software