how to remove duplines by first column? 
Author Message
 how to remove duplines by first column?

Vallo> I'm having newsgroups file which I want to correct using sed. Actually
Vallo> it's done already by using sed, awk, uniq and ed, but that's not
Vallo> perfect :)
Vallo> File format is simple: newsgroups name, following at least one space or
Vallo> tab or both, then following newsgroup topic which is free form text
Vallo> until the end of line. There are lots of duplicate lines, duplicate in
Vallo> sense of newsgroup name, not topic. I'll want to remove duplicate
Vallo> lines, sorting and formatting afterwards is simple, e.g

Vallo> comp.os.linux.announce  Announcements blabla (Moderated)
Vallo> comp.os.linux.announce  Announcements blabla (Moderated) (Moderated)
Vallo> comp.os.linux.x               blabla.
Vallo> comp.os.linux.x               blabla

Vallo> I've got to the stage where the output contained unique newsgroups
Vallo> lines without topic tail, but I can't get my mind around holding,
Vallo> restoring and printing the original line if the line is unique.
Vallo> What is better for such manipulation: sed, awk, perl, whatever ..?

I don't know about "better", but one line in Perl is gonna be hard to beat:

  perl -ane 'print unless $seen{$F[0]}++' <input >output

That'll print the first one.  If you want to print the last one instead,
and you don't mind sorting the output by newsgroup name:

  perl -ane '$item{$F[0]} = $_; END { print $item{$_} for sort keys %item }' \
    <input >output

If you want to hang on to the original definition order, still printing the
last one of each:

  perl -ane '$item{$F[0]} = $_; $line{$F[0]} = $.; " \
    -e 'END {print $item{$_} for sort {$line{$a} <=> $line{$b}} keys %item}' \
    <input >output

I bet you can do all three of these in awk.  The first one will
probably take fewer keystrokes in awk, but the last two most certainly
will take more.  sed won't have enough state memory to do this
conveniently.

print "Just another Perl hacker,"

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 1 post ] 

 Relevant Pages 

1. Tricky column formatting/space removing question

2. DBI - only first row returned, not whole column

3. Sorts on first six columns only

4. perl script to swap the first two columns of the file /etc/hosts

5. perl script to swap the first two columns of the file /etc/hosts

6. Tk::Table removing cells, rows, columns

7. New bies quest

8. New bies quest

9. how to remove first char of a scalar string

10. Removing first token of $ENV{'REMOTE_HOST'}

11. Remove first character in a string?

12. REGEXP question? Removing text up to first occurance

 

 
Powered by phpBB® Forum Software