Newbie awk (sed??) question, regular expressions 
Author Message
 Newbie awk (sed??) question, regular expressions

Hi All,

I have spent a day on this and got nowhere so I thought it was time to ask the
experts.

I have a CSV file in this format

ABA,960701,5.331,5.39,5.331,5.36,509500
ABAWCE,960701,1.76,1.76,1.76,1.76,0
ABRO,960701,0.065,0.065,0.065,0.065,0
....many more lines.......

The second field is a 6 digit date and I would like to change it to an 8 digit
format like this

ABA,19960701,5.331,5.39,5.331,5.36,509500
ABAWCE,19960701,1.76,1.76,1.76,1.76,0
ABRO,19960701,0.065,0.065,0.065,0.065,0

I tried something along the lines of sed -e
's/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

As you can see (now that you have stopped ROFL) I am a newbie.

I would like to work it out myself if you can provide a suitable hint.

I have The Awk Programming Language Aho, Kern, Wein
and sed & awk, Dougherty, Robbins, O'Reilly

Thanks for any help,

Alice



Sun, 23 Oct 2005 07:41:21 GMT  
 Newbie awk (sed??) question, regular expressions

Quote:

>Hi All,

>I have spent a day on this and got nowhere so I thought it was time to ask the
>experts.

>I have a CSV file in this format

>ABA,960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,960701,1.76,1.76,1.76,1.76,0
>ABRO,960701,0.065,0.065,0.065,0.065,0
>....many more lines.......

>The second field is a 6 digit date and I would like to change it to an 8 digit
>format like this

>ABA,19960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,19960701,1.76,1.76,1.76,1.76,0
>ABRO,19960701,0.065,0.065,0.065,0.065,0

>I tried something along the lines of sed -e
>'s/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

>As you can see (now that you have stopped ROFL) I am a newbie.

>I would like to work it out myself if you can provide a suitable hint.

>I have The Awk Programming Language Aho, Kern, Wein
>and sed & awk, Dougherty, Robbins, O'Reilly

>Thanks for any help,

>Alice

best is:

sed -e 's/\([^,]*,\)96/\11996/' test.cvs

but this would probably work:

sed -e 's/,96/,1996/' test.cvs

man sed

Chuck Demas

--
  Eat Healthy        |   _ _   | Nothing would be done at all,

  Die Anyway         |    v    | That no one could find fault with it.



Sun, 23 Oct 2005 08:11:24 GMT  
 Newbie awk (sed??) question, regular expressions
Hi  Chuck,

Thanks for the instant response. :)

I am pretty sure this one

sed -e 's/,96/,1996/' test.cvs

would mess up my data becasue I may have   ,96   in another field

If you have a minute could you quickly describe what the first one does?

sed -e 's/\([^,]*,\)96/\1996/' test.cvs

partucularly this part

\([^,]*,\)

specifically what I need is something that will look for   96    but it must
be preceeded by at least 3 letters and a comma.  And then change that 96 to
1996

Thanks again,

Alice



Quote:


>>Hi All,

>>I have spent a day on this and got nowhere so I thought it was time to ask the
>>experts.

>>I have a CSV file in this format

>>ABA,960701,5.331,5.39,5.331,5.36,509500
>>ABAWCE,960701,1.76,1.76,1.76,1.76,0
>>ABRO,960701,0.065,0.065,0.065,0.065,0
>>....many more lines.......

>>The second field is a 6 digit date and I would like to change it to an 8 digit
>>format like this

>>ABA,19960701,5.331,5.39,5.331,5.36,509500
>>ABAWCE,19960701,1.76,1.76,1.76,1.76,0
>>ABRO,19960701,0.065,0.065,0.065,0.065,0

>>I tried something along the lines of sed -e
>>'s/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

>>As you can see (now that you have stopped ROFL) I am a newbie.

>>I would like to work it out myself if you can provide a suitable hint.

>>I have The Awk Programming Language Aho, Kern, Wein
>>and sed & awk, Dougherty, Robbins, O'Reilly

>>Thanks for any help,

>>Alice

>best is:

>sed -e 's/\([^,]*,\)96/\11996/' test.cvs

>but this would probably work:

>sed -e 's/,96/,1996/' test.cvs

>man sed

>Chuck Demas



Sun, 23 Oct 2005 08:29:22 GMT  
 Newbie awk (sed??) question, regular expressions

Quote:

> Hi All,

> I have spent a day on this and got nowhere so I thought it was time to ask the
> experts.

> I have a CSV file in this format

> ABA,960701,5.331,5.39,5.331,5.36,509500
> ABAWCE,960701,1.76,1.76,1.76,1.76,0
> ABRO,960701,0.065,0.065,0.065,0.065,0
> ....many more lines.......

> The second field is a 6 digit date and I would like to change it to an 8 digit
> format like this

> ABA,19960701,5.331,5.39,5.331,5.36,509500
> ABAWCE,19960701,1.76,1.76,1.76,1.76,0
> ABRO,19960701,0.065,0.065,0.065,0.065,0

Alice,
If all you want to do is add a "19" to the second field, in awk you can just
use:

BEGIN {FS = "," ; OFS = ","}
{
$2 = "19" $2
print $0

Quote:
}

> I tried something along the lines of sed -e
> 's/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

> As you can see (now that you have stopped ROFL) I am a newbie.

Don't feel bad at all, I am about 2 days ahead of you.

Quote:

> I would like to work it out myself if you can provide a suitable hint.

I assume you will need to add different digits (i.e. "20") for dates in the
21st century, if you want your script to be y2k compliant. As a hint, you
could use:

substr($2,1,1)

to test if the first digit of the second field is a "9" and add the
appropriate extra digits.

Good luck!
Tug

- Show quoted text -

Quote:

> I have The Awk Programming Language Aho, Kern, Wein
> and sed & awk, Dougherty, Robbins, O'Reilly

> Thanks for any help,

> Alice



Sun, 23 Oct 2005 08:39:47 GMT  
 Newbie awk (sed??) question, regular expressions

X Hi All,
X
X I have spent a day on this and got nowhere so I thought it was time to ask the
X experts.
X
X I have a CSV file in this format
X
X ABA,960701,5.331,5.39,5.331,5.36,509500
X ABAWCE,960701,1.76,1.76,1.76,1.76,0
X ABRO,960701,0.065,0.065,0.065,0.065,0
X ....many more lines.......
X
X The second field is a 6 digit date and I would like to change it to an 8 digit
X format like this
X
X ABA,19960701,5.331,5.39,5.331,5.36,509500
X ABAWCE,19960701,1.76,1.76,1.76,1.76,0
X ABRO,19960701,0.065,0.065,0.065,0.065,0
X
X
X I tried something along the lines of sed -e
X 's/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv
X
X As you can see (now that you have stopped ROFL) I am a newbie.
X
X I would like to work it out myself if you can provide a suitable hint.
X
X I have The Awk Programming Language Aho, Kern, Wein
X and sed & awk, Dougherty, Robbins, O'Reilly
X
X Thanks for any help,
X
X Alice

Try this:

sed -e 's/^\(.[^,]*,\)\([1-9]\)/\119\2/' \
    -e 's/^\(.[^,]*,\)\([0]\)/\120\2/'   \
    test.csv

The first regular expression looks for the 2nd field to start with a
number from 1 through 9 and prepends 19 to the front of the date.  The
2nd RE looks for dates starting with 0 and prepends 20 to the date.

The \( ... \) remembers first matching RE (the first field including the
first comma.  The second \( ... \) remembers the first character of the
date.  On the substitution side of the s/// expression, the \1
substitutes the string found by the first \( ... \), then a 19 or a 20
depending on the RE that found the matching date, followed by \2 which
inserts the first digit of the original date from by the 2nd \( ... \).  
Since we did not match any of the rest of the input line, it is left
unchanged, and there is no need to include it in the substitution.

Give this a try.

And if you want to try awk (because you got the awk/sed book :-)

awk '
    BEGIN{FS=","; OFS=","}
    $2 >= 100000 {$2 = 19 $2; print; next}   # handles 1910 to 1999
    $2 <  100000 {$2 = 20 $2; print; next}   # handles 2000 to 2010
' test.csv

                                        Bob Harris



Sun, 23 Oct 2005 09:17:53 GMT  
 Newbie awk (sed??) question, regular expressions
Thanks Everyone,

Bob's sed solution was the one that worked first go.  The whole exercise had
been very  educational.

And Bob, thanks for the mini tutorial, now I have a concept that will help me
in the future.

Regards,

Alice

Quote:

>Hi All,

>I have spent a day on this and got nowhere so I thought it was time to ask the
>experts.

>I have a CSV file in this format

>ABA,960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,960701,1.76,1.76,1.76,1.76,0
>ABRO,960701,0.065,0.065,0.065,0.065,0
>....many more lines.......

>The second field is a 6 digit date and I would like to change it to an 8 digit
>format like this

>ABA,19960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,19960701,1.76,1.76,1.76,1.76,0
>ABRO,19960701,0.065,0.065,0.065,0.065,0

>I tried something along the lines of sed -e
>'s/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

>As you can see (now that you have stopped ROFL) I am a newbie.

>I would like to work it out myself if you can provide a suitable hint.

>I have The Awk Programming Language Aho, Kern, Wein
>and sed & awk, Dougherty, Robbins, O'Reilly

>Thanks for any help,

>Alice



Sun, 23 Oct 2005 09:23:47 GMT  
 Newbie awk (sed??) question, regular expressions
A quick follow up

Worked perfectly.

I have 5.5 million lines of data (213MB).

The sed solution took 2.5 minutes and the awk solution 3.2 minutes (Celeron
450/384MB).

One funny thing, the awk solution appended a ^M (carriage return?) to the end
of each line.

We live in an amazing period!!

Quote:

>Hi All,

>I have spent a day on this and got nowhere so I thought it was time to ask the
>experts.

>I have a CSV file in this format

>ABA,960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,960701,1.76,1.76,1.76,1.76,0
>ABRO,960701,0.065,0.065,0.065,0.065,0
>....many more lines.......

>The second field is a 6 digit date and I would like to change it to an 8 digit
>format like this

>ABA,19960701,5.331,5.39,5.331,5.36,509500
>ABAWCE,19960701,1.76,1.76,1.76,1.76,0
>ABRO,19960701,0.065,0.065,0.065,0.065,0

>I tried something along the lines of sed -e
>'s/[A-Z][A-Z][A-Z],96/[A-Z][A-Z][A-Z],1996/' test.csv

>As you can see (now that you have stopped ROFL) I am a newbie.

>I would like to work it out myself if you can provide a suitable hint.

>I have The Awk Programming Language Aho, Kern, Wein
>and sed & awk, Dougherty, Robbins, O'Reilly

>Thanks for any help,

>Alice



Sun, 23 Oct 2005 09:54:09 GMT  
 Newbie awk (sed??) question, regular expressions

% If you have a minute could you quickly describe what the first one does?
%
% sed -e 's/\([^,]*,\)96/\1996/' test.cvs

There are a couple of mistakes in this. I've seen your final follow-up,
but for interest's sake, this should be something like

  sed -e 's/^\([^,]*,\)96/\11996/' test.cvs

What it does is find all the matches to this regular expression

  ^[^,]*,96

and replace 96 with 1996. In the regular expression

  ^ matches the start of an input line
  [^,] matches anything but a comma
  *    constrains that match such that it will match zero or more
       of anything but a comma
  ,    matches a comma
  9    matches 9
  6    matches 6

  \(   starts a group or sub-expression. This doesn't affect the match
       but it allows grouping for the purpose of Kleene closure (the * thing)
       and it allows the thing matched by the sub-expression to be referred
       to later
  \)   ends a group or sub-expression

In the replacement string

  \1   is replaced by the thing that matched the 1st sub-expression
  1996 plays itself

In general:

  the match and replacement strings don't have to be delimited by /,
   although that's a common convention. You can use any convenient character:
   s,/some/path,/some/other/path,

  sed questions are off-topic in this newsgroup. Also sed answers, mea
   culpa

  awk regular expressions are different from sed ones. \1 is not available
   in replace strings (though I don't understand why -- subexpression matching
   and groups are delimited by () rather than \(\). There are other
   differences not relevant to this example.

  a good reference for things like this is the single unix specification.
  it's available on-line from http://www.opengroup.org (you have to join
  to get at it, I believe) (but there's no charge, just the occasional
  e-mail). Regular expressions are described in the definitions volume.
--

Patrick TJ McPhee
East York  Canada



Sun, 23 Oct 2005 10:58:12 GMT  
 Newbie awk (sed??) question, regular expressions
Patrick,

This too is fantastic input.

I have just discovered that every 10000 lines (or so :-) I have a carriage
return missing  eg

ZYL,19961105,1.300,1.300,1.300,1.300,6000AAAWMA,961106,0.265,0.285,0.265,0.285,85000
AAA,19961106,2.400,2.430,2.350,2.400,515065

which should be

ZYL,19961105,1.300,1.300,1.300,1.300,6000
AAAWMA,961106,0.265,0.285,0.265,0.285,85000
AAA,19961106,2.400,2.430,2.350,2.400,515065

Based on the extra info you have supplied I believe I can work out how to fix
that one by myself.  I will search for a digit followed by a letter and add a
<CR>.

Humble apologies for the off topic sed post in the awk conference.

Alice


Quote:



>% If you have a minute could you quickly describe what the first one does?
>%
>% sed -e 's/\([^,]*,\)96/\1996/' test.cvs

>There are a couple of mistakes in this. I've seen your final follow-up,
>but for interest's sake, this should be something like

>  sed -e 's/^\([^,]*,\)96/\11996/' test.cvs

>What it does is find all the matches to this regular expression

>  ^[^,]*,96

>and replace 96 with 1996. In the regular expression

>  ^ matches the start of an input line
>  [^,] matches anything but a comma
>  *    constrains that match such that it will match zero or more
>       of anything but a comma
>  ,    matches a comma
>  9    matches 9
>  6    matches 6

>  \(   starts a group or sub-expression. This doesn't affect the match
>       but it allows grouping for the purpose of Kleene closure (the * thing)
>       and it allows the thing matched by the sub-expression to be referred
>       to later
>  \)   ends a group or sub-expression

>In the replacement string

>  \1   is replaced by the thing that matched the 1st sub-expression
>  1996 plays itself

>In general:

>  the match and replacement strings don't have to be delimited by /,
>   although that's a common convention. You can use any convenient character:
>   s,/some/path,/some/other/path,

>  sed questions are off-topic in this newsgroup. Also sed answers, mea
>   culpa

>  awk regular expressions are different from sed ones. \1 is not available
>   in replace strings (though I don't understand why -- subexpression matching
>   and groups are delimited by () rather than \(\). There are other
>   differences not relevant to this example.

>  a good reference for things like this is the single unix specification.
>  it's available on-line from http://www.opengroup.org (you have to join
>  to get at it, I believe) (but there's no charge, just the occasional
>  e-mail). Regular expressions are described in the definitions volume.



Sun, 23 Oct 2005 11:21:57 GMT  
 Newbie awk (sed??) question, regular expressions

Quote:

> Hi  Chuck,

> Thanks for the instant response. :)

> I am pretty sure this one

> sed -e 's/,96/,1996/' test.cvs

> would mess up my data becasue I may have   ,96   in another field

No it wouldn't. Without a 'g' at the end of the script, the sed pattern just matches
the first occurrence on each line. This version:

    sed -e 's/,96/,1996/g' test.cvs

would mess up your data. You could even just use this to add a 19 after the first
comma:

    sed -e 's/,/,19/' test.cvs

and it'd work if all your dates are in the 19* range.

    Ed.



Mon, 24 Oct 2005 02:32:54 GMT  
 Newbie awk (sed??) question, regular expressions

Quote:

> A quick follow up

> Worked perfectly.

> I have 5.5 million lines of data (213MB).

> The sed solution took 2.5 minutes and the awk solution 3.2 minutes (Celeron
> 450/384MB).

If you feel like trying it, I'd be interested in hearing how long it takes with the
simpler sed script:

    sed -e 's/,0/,200/' -e 's/,\([1-9]\)/,19\1/' test.csv

Regards,

    Ed.



Mon, 24 Oct 2005 02:59:50 GMT  
 Newbie awk (sed??) question, regular expressions


<snip>

Quote:
> If you feel like trying it, I'd be interested in hearing how long it takes
with the
> simpler sed script:

>     sed -e 's/,0/,200/' -e 's/,\([1-9]\)/,19\1/' test.csv

Never mind - that script wouldn't quite work as the second pattern would
match the ,200 substituted for the first one and I'm getting kinda OT
anyway. Sorry 'bout that...

    Ed



Mon, 24 Oct 2005 07:30:15 GMT  
 
 [ 12 post ] 

 Relevant Pages 

1. How to print only a regular expression with SED or AWK

2. Regular Expression - newbie question

3. Newbie Question: Regular Expressions

4. awk pattern as a variable with regular expression

5. Regular expression !! and AWK

6. AWK: Using variables in regular Expressions

7. AWK: Using variables in regular expressions

8. awk/nawk capable of tagged regular expressions ?

9. RS regular expression with Unix AWK

10. Regular expression in awk

11. newbie - Regular expression help

12. regular expressions newbie

 

 
Powered by phpBB® Forum Software