awk or sed: basic? question 
Author Message
 awk or sed: basic? question

I am trying to get started with awk and sed.  As an exercise with an
unstructured file (ie. fields not clearly defined, target sometimes in the
third field sometimes in the second, etc) I was trying to extract all email
addresses from a file (this is definitely not with the intention of

do I get either sed or awk to output the result of the matching pattern
only and not the entire line.  At first I thought this would be a simple
task, but I have been going at this for hours now and am beginning to wonder
if you have to know the exact field ($n) in order to extract this type of
info with sed or awk.  I am not very strong in programming so this has
become quite a task.  Any comments would be most appreciated.  

Best Regards,
Ed Westin



Sat, 16 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question


Quote:
> I am trying to get started with awk and sed.  As an exercise with an
> unstructured file (ie. fields not clearly defined, target sometimes
in the
> third field sometimes in the second, etc) I was trying to extract all
email
> addresses from a file (this is definitely not with the intention of

how
> do I get either sed or awk to output the result of the matching
pattern
> only and not the entire line.  At first I thought this would be a
simple
> task, but I have been going at this for hours now and am beginning to
wonder
> if you have to know the exact field ($n) in order to extract this
type of
> info with sed or awk.  I am not very strong in programming so this has
> become quite a task.  Any comments would be most appreciated.

man awk
Look for substr(), match(), RSTART and RLENGTH.

If you don't run on Unix maybe you can start here:
http://www.cl.cam.ac.uk/texinfodoc/gawk_11.html#SEC110

Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.deja.com/
Before you buy.



Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:

>I am trying to get started with awk and sed.  As an exercise with an
>unstructured file (ie. fields not clearly defined, target sometimes in the
>third field sometimes in the second, etc) I was trying to extract all email
>addresses from a file (this is definitely not with the intention of

>do I get either sed or awk to output the result of the matching pattern
>only and not the entire line.  At first I thought this would be a simple
>task, but I have been going at this for hours now and am beginning to wonder
>if you have to know the exact field ($n) in order to extract this type of
>info with sed or awk.  I am not very strong in programming so this has
>become quite a task.  Any comments would be most appreciated.  

This is a much more complicated task than it may at first seem.

The first thing you must determine is what form an email address
may take.  This is defined in one of the RFC's, but IIRC, an
address _MAY_ include spaces inside double quotes.

After you've found the forms that a valid address may take, then
you can eliminate all characters that may not be in a valid address.

If one were to assume that a proper email address could contain
only alphanumeric charaters, underscore, hyphen, at sign, and period,
but not space, and must contain an at sign, then it is more
straightforward, something like:



tr ' ' '\012' |
sed -e 's/\.*$//' |

the functions of the lines are:

1 get only lines containing the at sign
2 change all unusable characters to spaces
3 change all spaces to newlines
4 remove periods at the end of lines
5 print only lines containing at signs

As I said, this is not a complete solution, but it shows a general
starting approach.

It could be simplified, BTW, I just didn't bother.  :-)

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:



>>I am trying to get started with awk and sed.  As an exercise with an
>>unstructured file (ie. fields not clearly defined, target sometimes in the
>>third field sometimes in the second, etc) I was trying to extract all email
>>addresses from a file (this is definitely not with the intention of

>>do I get either sed or awk to output the result of the matching pattern
>>only and not the entire line.  At first I thought this would be a simple
>>task, but I have been going at this for hours now and am beginning to wonder
>>if you have to know the exact field ($n) in order to extract this type of
>>info with sed or awk.  I am not very strong in programming so this has
>>become quite a task.  Any comments would be most appreciated.  

>This is a much more complicated task than it may at first seem.

>The first thing you must determine is what form an email address
>may take.  This is defined in one of the RFC's, but IIRC, an
>address _MAY_ include spaces inside double quotes.

>After you've found the forms that a valid address may take, then
>you can eliminate all characters that may not be in a valid address.

>If one were to assume that a proper email address could contain
>only alphanumeric charaters, underscore, hyphen, at sign, and period,
>but not space, and must contain an at sign, then it is more
>straightforward, something like:



>tr ' ' '\012' |
>sed -e 's/\.*$//' |

>the functions of the lines are:

>1 get only lines containing the at sign
>2 change all unusable characters to spaces
>3 change all spaces to newlines
>4 remove periods at the end of lines
>5 print only lines containing at signs

>As I said, this is not a complete solution, but it shows a general
>starting approach.

>It could be simplified, BTW, I just didn't bother.  :-)

I left the underscore out of line 2 above.  It should be:



tr ' ' '\012' |
sed -e 's/\.*$//' |

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:
Demas) writes:

...

Quote:
>If one were to assume that a proper email address could contain
>only alphanumeric charaters, underscore, hyphen, at sign, and period,
>but not space, and must contain an at sign, then it is more
>straightforward, something like:



>tr ' ' '\012' |
>sed -e 's/\.*$//' |

>the functions of the lines are:

>1 get only lines containing the at sign
>2 change all unusable characters to spaces
>3 change all spaces to newlines
>4 remove periods at the end of lines
>5 print only lines containing at signs

>As I said, this is not a complete solution, but it shows a general
>starting approach.

>It could be simplified, BTW, I just didn't bother.  :-)

What about tabs?


I think (ie, untested) this could be simplified to



Quote:
}' infile

which has the advantage of being just a single process. And if the original
poster needs to handle double-quoted names with embedded spaces, sed won't
suffice.


Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:


>Demas) writes:
>...
>>If one were to assume that a proper email address could contain
>>only alphanumeric charaters, underscore, hyphen, at sign, and period,
>>but not space, and must contain an at sign, then it is more
>>straightforward, something like:



>>tr ' ' '\012' |
>>sed -e 's/\.*$//' |

>>the functions of the lines are:

>>1 get only lines containing the at sign
>>2 change all unusable characters to spaces
>>3 change all spaces to newlines
>>4 remove periods at the end of lines
>>5 print only lines containing at signs

>>As I said, this is not a complete solution, but it shows a general
>>starting approach.

>>It could be simplified, BTW, I just didn't bother.  :-)

>What about tabs?


>I think (ie, untested) this could be simplified to



>}' infile

>which has the advantage of being just a single process. And if the original
>poster needs to handle double-quoted names with embedded spaces, sed won't
>suffice.

But, you didn't eliminate all the characters that aren't allowed in the
email address in your script.  

It won't print anything for this line:


Because you looked for a field starting with only certain characters.
Adding a gsub will solve that.  :-)


           for (f = 1; f <= NF; ++f){

                   sub(/[.]$/, "", $f); print $f }}}' infile

though this might be simpler:


           for (f = 1; f <= NF; ++f){

                   sub(/[.]$/, "", $f);
                   print $f }}}' infile

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question
Charles, Harlan, and Pete:

Thank you very much for your help. You have given me excellent ideas for
dealing with this and similar tasks in sed and awk.  Instinct, and the
extremely powerful feel of these tools, sort of tells me that it is time
well spent to get at least a basic mastery of them before continuing with
Perl, which I had been using for about two months.

Much Obliged,
Ed Westin



Sun, 17 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:
Demas) writes:

<snip>

Quote:
>But, you didn't eliminate all the characters that aren't allowed in the
>email address in your script.  

>It won't print anything for this line:


>Because you looked for a field starting with only certain characters.
>Adding a gsub will solve that.  :-)


>       for (f = 1; f <= NF; ++f){

>               sub(/[.]$/, "", $f); print $f }}}' infile

>though this might be simpler:


>       for (f = 1; f <= NF; ++f){

>               sub(/[.]$/, "", $f);
>               print $f }}}' infile

Actually it would be easier still to make all characters that can't occur in an


position (or can they?), then make the RECORD separator RS =


Now for an academic question: can perl's $\ be a regexp? The camel book says it
can be multicharacter, but it doesn't say it can be a regexp.



Mon, 18 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:


> Demas) writes:
> <snip>
> Now for an academic question: can perl's $\ be a regexp? The camel book says it
> can be multicharacter, but it doesn't say it can be a regexp.

...I think you have your slash backwards...

perl's $/ (which is aliased to $RS if use()ing English) can be multi-
character, but is not a regexp.

perl's $\ ($OFS) is multicharacter, but, being just output, is also not
a regex.

from perldoc perlvar:

Remember: the value of $/ is a string, not a regexp.  AWK has to be
better for something :-)
                                        --Larry Wall

Colin DeVilbiss



Mon, 18 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:


>Demas) writes:

><snip>

>>But, you didn't eliminate all the characters that aren't allowed in the
>>email address in your script.  

>>It won't print anything for this line:


>>Because you looked for a field starting with only certain characters.
>>Adding a gsub will solve that.  :-)


>>           for (f = 1; f <= NF; ++f){

>>                   sub(/[.]$/, "", $f); print $f }}}' infile

>>though this might be simpler:


>>           for (f = 1; f <= NF; ++f){

>>                   sub(/[.]$/, "", $f);
>>                   print $f }}}' infile

>Actually it would be easier still to make all characters that can't occur in an


>position (or can they?), then make the RECORD separator RS =



For the record, I tried the above one-liner on a sample file, and it
didn't work.  I didn't bother to try to figure out why.

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Tue, 19 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question

Quote:
Demas) writes:

>For the record, I tried the above one-liner on a sample file, and it
>didn't work.  I didn't bother to try to figure out why.

And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
and it pull all the 'e-mail addresses', which given the definition above pulled
part of the message IDs.

It almost seems Chuck doesn't want this to work.

This _might_ be a functionality difference between Win32 and unix versions of
gawk. I ran this under Windows95 on DOS text files (CR-LF line termination
sequence in the file itself). If this is a portability issue, it might be nice
to confirm.



Tue, 19 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question


Quote:
% Demas) writes:

%

% >
% >For the record, I tried the above one-liner on a sample file, and it
% >didn't work.  I didn't bother to try to figure out why.
%
% And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
% and it pull all the 'e-mail addresses', which given the definition above pulled
% part of the message IDs.

This worked OK on my inbox using gawk and mawk, but not using nawk. nawk
doesn't seem to like [] in RS, although it does accept some REs.

--

Patrick TJ McPhee
East York  Canada



Tue, 19 Mar 2002 03:00:00 GMT  
 awk or sed: basic? question


Quote:



>% Demas) writes:
>%

>% >
>% >For the record, I tried the above one-liner on a sample file, and it
>% >didn't work.  I didn't bother to try to figure out why.
>%
>% And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
>% and it pull all the 'e-mail addresses', which given the definition above pulled
>% part of the message IDs.

>This worked OK on my inbox using gawk and mawk, but not using nawk. nawk
>doesn't seem to like [] in RS, although it does accept some REs.

My usual shell shell account (at TIAC.NET) has an older version
of gawk installed:

Gnu Awk (gawk) 2.15, patchlevel 6

The script in question does not work as desired on that version.

On another shell account, at a different ISP, I have access to a
later version of gawk, version 3.0.3, and the script works just fine
on that version.

Unfortunately, I have no real control over what is available or
installed at TIAC.NET, so I cannot just "upgrade it." to
the latest version.  :~(

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Tue, 19 Mar 2002 03:00:00 GMT  
 
 [ 15 post ] 

 Relevant Pages 

1. Newbie awk (sed??) question, regular expressions

2. Awk/Sed Filehandler question

3. A very simple question on SED or AWK for a GURU, and an enjoyable problem

4. 2 questions from book sed and awk programming

5. Question: How to remove END OF LINE using AWK or SED

6. awk vs. sed question

7. Sed/awk question

8. Sed/awk question Answered!

9. a basic question on how to simulate grep -v using awk

10. Basic awk question

11. sed, awk, perl

12. SED to AWK...???

 

 
Powered by phpBB® Forum Software