SED Script 
Author Message
 SED Script

Hi,

I wrote a script that is doing what I want...kind of.  It could be more
efficient I'm sure of it, though don't know how.

Here's my script...
    : more
    $!N
    s/\n\t/&/2000;
    t enough
    $!b more

    : enough
    s/\n\t/\t/g

My source data looks like this....
    Data on line1
        Data on line2
        Data on line3
    Data on line4
        data on line5
        data on line6
        data on line7

Each line is different, but what I'm trying to accomplish is to remove the
<CR-LF> from <CR-LF><TAB>
Using a text editor I can do it, though I want to do it programmatically.
My end result would look like this....
    Data on line1    Data on line2    Data on line3
    Data on line4    data on line5    data on line6    data on line7

I'm relatively new to regexp's and sed, but am a quick study.  Many thanks

replace the * with .

Trevor



Mon, 08 Dec 2003 16:15:54 GMT  
 SED Script
Why not use awk?

    awk '{
            if ($0 ~ /^     /) printf("\t%s", $0);
            else printf("\n%s", $0);
         } END {print}' filetoparse

Quote:

> Hi,

> I wrote a script that is doing what I want...kind of.  It could be more
> efficient I'm sure of it, though don't know how.

> Here's my script...
>     : more
>     $!N
>     s/\n\t/&/2000;
>     t enough
>     $!b more

>     : enough
>     s/\n\t/\t/g

> My source data looks like this....
>     Data on line1
>         Data on line2
>         Data on line3
>     Data on line4
>         data on line5
>         data on line6
>         data on line7

> Each line is different, but what I'm trying to accomplish is to remove the
> <CR-LF> from <CR-LF><TAB>
> Using a text editor I can do it, though I want to do it programmatically.
> My end result would look like this....
>     Data on line1    Data on line2    Data on line3
>     Data on line4    data on line5    data on line6    data on line7

> I'm relatively new to regexp's and sed, but am a quick study.  Many thanks

> replace the * with .

> Trevor



Tue, 09 Dec 2003 02:21:41 GMT  
 SED Script

Quote:

> I wrote a script that is doing what I want...kind of.  It could be more
> efficient I'm sure of it, though don't know how.

> Here's my script...
>     : more
>     $!N
>     s/\n\t/&/2000;
>     t enough
>     $!b more

>     : enough
>     s/\n\t/\t/g

This is a good example of why I generally restrict my use of sed
to simple substitutions 'n' such. Scripts like this one make me
dizzy.

Quote:
> My source data looks like this....
>     Data on line1
>         Data on line2
>         Data on line3
>     Data on line4
>         data on line5
>         data on line6
>         data on line7

> Each line is different, but what I'm trying to accomplish is to remove the
> <CR-LF> from <CR-LF><TAB>
> Using a text editor I can do it, though I want to do it programmatically.
> My end result would look like this....
>     Data on line1    Data on line2    Data on line3
>     Data on line4    data on line5    data on line6    data on line7

> I'm relatively new to regexp's and sed, but am a quick study.

It's easier in Perl:

$ cat data
Data on line1
        Data on line2
        Data on line3
Data on line4
        Data on line5
        Data on line6
        Data on line7
$ perl -0777 -pe 's/\n(?=\t)//g' data
Data on line1   Data on line2   Data on line3
Data on line4   Data on line5   Data on line6   Data on line7
$

This won't work for huge files, but it's the just-right solution
(easy to write, easy to understand) for non-huge files.

Quote:
> Many thanks ...

You're welcome.

Quote:

> replace the * with .

Nope. Sorry.

--
Jim Monty

Tempe, Arizona USA



Tue, 09 Dec 2003 02:39:47 GMT  
 SED Script
...

Quote:
>My source data looks like this....
>    Data on line1
>        Data on line2
>        Data on line3
>    Data on line4
>        data on line5
>        data on line6
>        data on line7

>Each line is different, but what I'm trying to accomplish is to remove the
><CR-LF> from <CR-LF><TAB>
>Using a text editor I can do it, though I want to do it programmatically.
>My end result would look like this....
>    Data on line1    Data on line2    Data on line3
>    Data on line4    data on line5    data on line6    data on line7

This would be trivial for gawk.

gawk -v RS="\n\t" -v ORS="\t" '{ print }' inputfile

Other awks require only a bit more work.

awk 'BEGIN { ORS = ""; getline; print }
!/^\t/ { print "\n" }
{ print }
END { print "\n" }' inputfile

...

Ask here, read here.



Tue, 09 Dec 2003 03:00:52 GMT  
 SED Script

An alternate solution using sed(will work as long as the max line
lenght does not increase beyond "hold" space's capacity).
sed -f test.sed <input file>
test.sed - command file
<input file> - input data file

Quote:

> Hi,

> I wrote a script that is doing what I want...kind of.  It could be more
> efficient I'm sure of it, though don't know how.

> Here's my script...
>     : more
>     $!N
>     s/\n\t/&/2000;
>     t enough
>     $!b more

>     : enough
>     s/\n\t/\t/g

> My source data looks like this....
>     Data on line1
>         Data on line2
>         Data on line3
>     Data on line4
>         data on line5
>         data on line6
>         data on line7

> Each line is different, but what I'm trying to accomplish is to remove the
> <CR-LF> from <CR-LF><TAB>
> Using a text editor I can do it, though I want to do it programmatically.
> My end result would look like this....
>     Data on line1    Data on line2    Data on line3
>     Data on line4    data on line5    data on line6    data on line7

> I'm relatively new to regexp's and sed, but am a quick study.  Many thanks

> replace the * with .

> Trevor

[ test.sed < 1K ]
#n
:begin
h
n
:redo
${
b finish

Quote:
}

/^      /{
H
n
b redo
Quote:
}

:finish
x
s/\n    / /g
p
x
b begin


Tue, 09 Dec 2003 03:10:18 GMT  
 SED Script

Quote:


> > My end result would look like this....
> >     Data on line1    Data on line2    Data on line3
> >     Data on line4    data on line5    data on line6    data on line7

> This would be trivial for gawk.

> gawk -v RS="\n\t" -v ORS="\t" '{ print }' inputfile

This prints a straggling tab:

$ cat data
Data on line1
        Data on line2
        Data on line3
Data on line4
        Data on line5
        Data on line6
        Data on line7
$ gawk -v RS="\n\t" -v ORS="\t" '{ print }' data
Data on line1   Data on line2   Data on line3
Data on line4   Data on line5   Data on line6   Data on line7
        $
^^^^^^^^

$ gawk -v RS='\n\t' -v ORS='\t' '{ print }' data | od -c
0000000    D   a   t   a       o   n       l   i   n   e   1  \t   D   a
0000020    t   a       o   n       l   i   n   e   2  \t   D   a   t   a
0000040        o   n       l   i   n   e   3  \n   D   a   t   a       o
0000060    n       l   i   n   e   4  \t   D   a   t   a       o   n
0000100    l   i   n   e   5  \t   D   a   t   a       o   n       l   i
0000120    n   e   6  \t   D   a   t   a       o   n       l   i   n   e
0000140    7  \n  \t
0000143
$                 ^^

--
Jim Monty

Tempe, Arizona USA



Tue, 09 Dec 2003 05:50:29 GMT  
 SED Script

Quote:


> > My end result would look like this....
> >     Data on line1    Data on line2    Data on line3
> >     Data on line4    data on line5    data on line6    data on line7

> Other awks require only a bit more work.

> awk 'BEGIN { ORS = ""; getline; print }
> !/^\t/ { print "\n" }
> { print }
> END { print "\n" }' inputfile

This prints a newline when fed nothing for input:

$ cat grove.awk
BEGIN { ORS = ""; getline; print }
!/^\t/ { print "\n" }
{ print }
END { print "\n" }
$ awk -f grove.awk /dev/null

$ perl -0777 -pe 's/\n(?=\t)//g' /dev/null
$

--
Jim Monty

Tempe, Arizona USA



Tue, 09 Dec 2003 06:12:59 GMT  
 SED Script
Quote:


...
>>gawk -v RS="\n\t" -v ORS="\t" '{ print }' inputfile

> This prints a straggling tab:

...

True. Definitely didn't think of that. Easy enough to handle.

gawk -v RS="\0" -v ORS="" '{gsub(/\n\t/, "\t"); print}' inputfile

<and in addition>

Quote:


...
>>awk 'BEGIN { ORS = ""; getline; print }
>>!/^\t/ { print "\n" }
>>{ print }
>>END { print "\n" }' inputfile

> This prints a newline when fed nothing for input:

More careless bounds checking by me. Sheesh, you'd think I'd get it right
given what I'm paid for this. Also easily fixed (fascinating that some
people prefer posting perl code to comp.lang.awk rather than fixing awk
scripts themselves).

awk 'BEGIN { ORS = ""; if ((getline) > 0) print }
!/^\t/ { print "\n" }
{ print }
END { if (NR) print "\n" }' inputfile



Wed, 10 Dec 2003 01:15:53 GMT  
 SED Script

Quote:



> > > gawk -v RS="\n\t" -v ORS="\t" '{ print }' inputfile

> > This prints a straggling tab:

> True. Definitely didn't think of that. Easy enough to handle.

> gawk -v RS="\0" -v ORS="" '{gsub(/\n\t/, "\t"); print}' inputfile

This incorrectly removes NUL (\0) characters:

$ od -c data
0000000   \0   D   a   t   a       o   n       l   i   n   e   1  \n  \0
0000020   \t   D   a   t   a       o   n       l   i   n   e   2  \n  \0
0000040   \t   D   a   t   a       o   n       l   i   n   e   3  \n  \0
0000060    D   a   t   a       o   n       l   i   n   e   4  \n  \0  \t
0000100    D   a   t   a       o   n       l   i   n   e   5  \n  \0  \t
0000120    D   a   t   a       o   n       l   i   n   e   6  \n  \0  \t
0000140    D   a   t   a       o   n       l   i   n   e   7  \n
0000156
$ gawk -v RS='\0' -v ORS='' '{ gsub(/\n\t/, "\t"); print }' data | od -c
0000000    D   a   t   a       o   n       l   i   n   e   1  \n  \t   D
0000020    a   t   a       o   n       l   i   n   e   2  \n  \t   D   a
0000040    t   a       o   n       l   i   n   e   3  \n   D   a   t   a
0000060        o   n       l   i   n   e   4  \n  \t   D   a   t   a
0000100    o   n       l   i   n   e   5  \n  \t   D   a   t   a       o
0000120    n       l   i   n   e   6  \n  \t   D   a   t   a       o   n
0000140        l   i   n   e   7  \n
0000147
$ perl -0777 -pe 's/\n(?=\t)//g' data | od -c
0000000   \0   D   a   t   a       o   n       l   i   n   e   1  \n  \0
0000020   \t   D   a   t   a       o   n       l   i   n   e   2  \n  \0
0000040   \t   D   a   t   a       o   n       l   i   n   e   3  \n  \0
0000060    D   a   t   a       o   n       l   i   n   e   4  \n  \0  \t
0000100    D   a   t   a       o   n       l   i   n   e   5  \n  \0  \t
0000120    D   a   t   a       o   n       l   i   n   e   6  \n  \0  \t
0000140    D   a   t   a       o   n       l   i   n   e   7  \n
0000156
$

The revised gawk command you posted is a port of the Perl command
I posted and has the same input size limitation: it's not a practical
solution for huge files. In addition to the aforementioned bug,
your GNU awk port differs slightly from my Perl script in that the
tab it matches in the regular expression pattern is restored in
the replacement string. My Perl regular expression pattern asserts
the presence of the tab but does not consume the tab in the match,
thereby eliminating the need to replace the tab with itself. Gawk
does not have zero-width lookahead assertions.

The OP posted a "kind of" working sed script, then vaguely asked
if there was a better way "to remove the <CR-LF> from <CR-LF><TAB>."
The Perl script I posted ('s/\n(?=\t)//g') did exactly that--no
more, no less--and did it in a way that is immediately obvious and
understandable to anyone who knows Perl, even just a bit. That was
the point of posting it.

The first response to the OP's inquiry was an awk script that didn't
even work. Another respondent posted an even more complicated sed
script than the OP's own sed script. You responded with several
awk scripts, both of which had subtle flaws.

Quote:
> > > awk 'BEGIN { ORS = ""; getline; print }
> > > !/^\t/ { print "\n" }
> > > { print }
> > > END { print "\n" }' inputfile

> > This prints a newline when fed nothing for input:

> More careless bounds checking by me. Sheesh, you'd think I'd get it right
> given what I'm paid for this.

Despite your sarcasm, I find your newfound circumspection both
encouraging and refreshing. Does this mean you've retired from your
self-appointed role as comp.lang.awk's resident ballbuster and
newsgroup wrecker?

Quote:
> (fascinating that some
> people prefer posting perl code to comp.lang.awk rather than fixing awk
> scripts themselves).

I've fixed plenty. (See Google.) I've fixed yours. (Ibid.) I don't
"prefer" posting Perl code; I prefer posting code that works and
is responsive to the question or problem posed by the OP.

--
Jim Monty

Tempe, Arizona USA



Wed, 10 Dec 2003 03:34:46 GMT  
 SED Script
Thank you everyone, esp Mr. Monty.  I will try these suggestions and post my
results.


Sat, 13 Dec 2003 00:08:27 GMT  
 SED Script
This worked like a champ!!!  Thank you!

 -even though I don't quite understand it yet.

Quote:

>An alternate solution using sed(will work as long as the max line
>lenght does not increase beyond "hold" space's capacity).
>sed -f test.sed <input file>
>test.sed - command file
><input file> - input data file



Sat, 13 Dec 2003 09:52:35 GMT  
 SED Script
...
Quote:
> This incorrectly removes NUL (\0) characters:

...

Yes, gotta watch for all those NUL's in text files. So (sh/ksh/bash
command)

gawk -v RS="\n(\t?)" '{ printf("%s%s", rt, $0); rt = substr(RT,
length(RT)) }
END { printf rt }' inputfile

which is sufficiently inelegant that the generic awk solution is
probably better.

Quote:
>your GNU awk port differs slightly from my Perl script in that the
>tab it matches in the regular expression pattern is restored in
>the replacement string. My Perl regular expression pattern asserts
>the presence of the tab but does not consume the tab in the match,
>thereby eliminating the need to replace the tab with itself. Gawk
>does not have zero-width lookahead assertions.

So what? You're assuming perl is doing less work (you're implicitly
considering HOW perl uses the look ahead assertion). Both your perl s
operation and my gawk gsub operation are replacing a matched substring
with a constant string. The extra work done by perl's necessarily
greater overhead dealing with the look ahead assertion offsets the
greater work done by gawk rewriting the tab.

Quote:
>The OP posted a "kind of" working sed script, then vaguely asked
>if there was a better way "to remove the <CR-LF> from <CR-LF><TAB>."
>The Perl script I posted ('s/\n(?=\t)//g') did exactly that--no
>more, no less--and did it in a way that is immediately obvious and
>understandable to anyone who knows Perl, even just a bit. That was
>the point of posting it.

...

#include <stdio.h>
int main() {
char c, n;
while ((c = getline()) != EOF) {
if (c == '\n') {
n = '\n';

Quote:
} else if (c == '\t') {
putchar(n = c);
} else {

if (n == '\n') putchar(n);
putchar(n = c);
Quote:
}
}

if (n == '\n') putchar(n);
return 0;

Quote:
}

This is C. It does exactly what the OP wants (strips all newlines
immediately preceding tabs). It's almost certainly faster than any
scripting language at this particular task. It's also OT in this
newsgroup.

Is your perl solution more efficient in terms of coding? Yes. Perl is
generally a better replacement for sed than [g]awk is.



Sat, 13 Dec 2003 22:01:12 GMT  
 SED Script

Quote:


> > your GNU awk port differs slightly from my Perl script in that the
> > tab it matches in the regular expression pattern is restored in
> > the replacement string. My Perl regular expression pattern asserts
> > the presence of the tab but does not consume the tab in the match,
> > thereby eliminating the need to replace the tab with itself. Gawk
> > does not have zero-width lookahead assertions.

> So what? You're assuming perl is doing less work (you're implicitly
> considering HOW perl uses the look ahead assertion).

I made no assumptions, nor any assertions, about the amount of work
being done by Perl. *You're* the one who's obsessed with pointless
attempts to micro-optimize trivial awk scripts, not I. I was simply
pointing out an operative difference between my original Perl script
and your GNU awk port of my code--which, by the way, you were not
obliged to attribute to me by virtue of its utter trivialness.

But now that you mention it, this

  s/\n(?=\t)//g

is better than this

  s/\n\t/\t/g

simply because it more closely reflects the real intent of the
operation. It precisely expresses the notion of replacing all (g)
the newlines (\n) that are immediately followed by a tab (?=\t)
with nothing (//). The beauty of lookaround assertions is the
greater expressiveness they lend to patterns and to pattern matching
operations, especially substitutions.

Quote:
> Both your perl s operation and my gawk gsub operation are replacing
> a matched substring with a constant string. The extra work done
> by perl's necessarily greater overhead dealing with the look
> ahead assertion offsets the greater work done by gawk rewriting
> the tab.

What overhead? Even if there *were* perceptible overhead, it would
occur only once, when the regular expression is compiled. The
substitution of the tab occurs every time the pattern matches.
Besides, you got your arguments screwed up. In the first sentence
you assert that the replacement of the matched text with a constant
string is the same whether that constant string has a length of 0
or 1. But then in the second sentence you state that the "greater
work done by gawk" in replacing the matched text with a tab (vis-a-vis
an empty string in the Perl script) is offset by the greater overhead
of the lookahead assertion in the Perl script. Which is it?

I do agree with you on this point: "So what?"

Quote:
> > The OP posted a "kind of" working sed script, then vaguely asked
> > if there was a better way "to remove the <CR-LF> from <CR-LF><TAB>."
> > The Perl script I posted ('s/\n(?=\t)//g') did exactly that--no
> > more, no less--and did it in a way that is immediately obvious and
> > understandable to anyone who knows Perl, even just a bit. That was
> > the point of posting it.

> #include <stdio.h>
> int main() {
> char c, n;
> while ((c = getline()) != EOF) {
> if (c == '\n') {
> n = '\n';
> } else if (c == '\t') {
> putchar(n = c);
> } else {
> if (n == '\n') putchar(n);
> putchar(n = c);
> }
> }
> if (n == '\n') putchar(n);
> return 0;
> }

> This is C.

Yeah. "C" for "crap."

Quote:
> It does exactly what the OP wants (strips all newlines
> immediately preceding tabs). It's almost certainly faster than any
> scripting language at this particular task.

It doesn't even compile!

  $ cat crap.c
  #include <stdio.h>
  int main() {
  char c, n;
  while ((c = getline()) != EOF) {
  if (c == '\n') {
  n = '\n';
  } else if (c == '\t') {
  putchar(n = c);
  } else {
  if (n == '\n') putchar(n);
  putchar(n = c);
  }
  }
  if (n == '\n') putchar(n);
  return 0;
  }
  $ gcc crap.c
  /var/tmp/cc0216901.o: Undefined symbol _getline referenced from text segment
  $

It's so poorly formatted it's unreadable. But worse, the value of
n is needlessly re-initialized at ever byte of input! Perhaps you
were so busy scouring the program for niggling ways to shave a
nanosecond here or there off the execution speed by optimizing some
primitive operation that you didn't notice you had senselessly
increased the number of operations performed by the whole program
by having made an elementary programming error.

So now I have to fix your C code, too?

  $ cat pivot.c
  /* pivot.c */

  #include <stdio.h>
  #define BUF_SIZE 32766

  int main(void) {
      register int chr, buf;

      setvbuf(stdin, NULL, _IOFBF, BUF_SIZE);

      while ((chr = getchar()) != EOF) {
          if (chr == '\n') {
              buf = '\n';
          } else if (chr == '\t') {
              putchar(chr);
              buf = 0;
          } else {
              if (buf == '\n') {
                  putchar(buf);
                  buf = 0;
              }
              putchar(chr);
          }
      }

      if (buf == '\n')
          putchar(buf);

      return 0;
  }
  $ gcc pivot.c -o pivot
  $ chmod +x pivot
  $ ./pivot
  $ cat data
  Data on line1
        Data on line2
        Data on line3
  Data on line4
        Data on line5
        Data on line6
        Data on line7
  $ ./pivot <data
  Data on line1   Data on line2   Data on line3
  Data on line4   Data on line5   Data on line6   Data on line7
  $

I've included the obvious optimizations in addition to the bug
fixes. But if I had written the program myself, I woud not have
used your same convoluted logic. It's still *your* program, not
mine.

Quote:
> It's also OT in this newsgroup.

But, as usual, that didn't stop you from posting it. It seems the
issue of topicality only bothers you in certain specific circumstances:
usually, when the "offender" is me and I haven't littered the text
of my article as you do with duplicitous warnings about how what
I'm about to write is off topic in this newsgroup.

Quote:
> Is your perl solution more efficient in terms of coding? Yes. Perl is
> generally a better replacement for sed than [g]awk is.

Throwing Perl a bone, eh? It's funny to hear *you* say Perl is
"generally a better replacement for sed" than awk. You once criticized
the common Perl idiom of placing non-trivial code in the eval'd
replacement string of a substitution operation as being too
complicated. Wrapping whole programs within an outer construct like
this

  #!/usr/bin/perl -n
  s{
    <arbitrarily complex regular expression pattern> ...
  }{
    <arbitrarily complex program that returns a replacement string> ...
  }gex;

is neither more nor less complicated than programs written like
this

  #!/bin/awk -f
  /<match a pattern>/ { <perform an action> }
  ...

but it's far more powerful.

Perl is the state of the art in text processing and regular expression
pattern matching. When you write that Perl is better at emulating
sed than awk is, I laugh at your nescience.

Jnaan fgbc lrg?

--
Jim Monty

Tempe, Arizona USA



Tue, 16 Dec 2003 16:05:27 GMT  
 SED Script
...
Quote:
>But now that you mention it, this

>  s/\n(?=\t)//g

>is better than this

>  s/\n\t/\t/g

>simply because it more closely reflects the real intent of the
>operation. It precisely expresses the notion of replacing all (g)
>the newlines (\n) that are immediately followed by a tab (?=\t)
>with nothing (//). The beauty of lookaround assertions is the
>greater expressiveness they lend to patterns and to pattern matching
>operations, especially substitutions.

...

Look ahead assertions are more elegant, but in this particular case
unnecessary. The intent of the operation can be expressed equivalently as
'delete all newlines immediately followed by a tab' and 'replace all
newline-tab sequences with tabs' (which I believe is how the OP expressed
it).

Quote:
>Perl is the state of the art in text processing and regular expression
>pattern matching. When you write that Perl is better at emulating
>sed than awk is, I laugh at your nescience.

Perl is tops with regard to regular expressions, but that's NOT all there is
to text processing. There are various things awk can do more concisely than
perl. See the comments in a2p.

But this is an _awk_ newsgroup. A few months ago there was a long thread
about inverting matrices in awk. Child's play in Matlab, Scilab, R, fortran,
APL, J (yada yada yada), so what would have been served by giving the 3
characters of APL to accomplish this (assuming a 1 character variable name)?

Awk is a language with every bit as much practicality as ancient Greek. And
just like comments about the state of the art nature of modern English or
German in ancient Greek newsgroups, comments about perl in awk newsgroups
are by definition OT and would seem to serve little purpose other than to
boost the ego of person posting them.

If you really had the courage of your convictions, you'd crosspost perl
answers to comp.lang.perl.misc so other knowledgeable perl users would have
the chance to comment on your perl code. However, it seems you're much
happier being the resident perl expert in an awk newsgroup.



Wed, 17 Dec 2003 03:44:42 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. How do experts write extensible, incrementally upgradeable, structured sed scripts

2. Sed script to replace nth instance of >?

3. Run awk from within sed script?

4. .eiffelshow sed script

5. sed/awk script to format a file??

6. Newbie awk (sed??) question, regular expressions

7. Awk/Sed Filehandler question

8. sed, awk, perl

9. sed (replace extra space) with pipe delimiter

10. A very simple question on SED or AWK for a GURU, and an enjoyable problem

11. Appending in SED

12. help on sed file

 

 
Powered by phpBB® Forum Software