!x[$0]++ 
Author Message
 !x[$0]++

Just double checking, for my files, did we all agree that this was the best
solution to the problem?

To review:
        1) The problem: Uniquify a file - print only the first occurrence of
           any lines that are duplicated (occur more than once) in the file.
        2) The solution: !x[$0]++

Have I got it all right?



Fri, 31 Jan 2003 03:00:00 GMT  
 !x[$0]++

Quote:

>Just double checking, for my files, did we all agree that this was the best
>solution to the problem?
>To review:
>        1) The problem: Uniquify a file - print only the first occurrence of
>           any lines that are duplicated (occur more than once) in the file.
>        2) The solution: !x[$0]++
>Have I got it all right?

I think I was the one that posted this.
I'm fairly sure this is the shortest way, yes.
It may not be the fastest though.
IIRC, optimisations to avoid the increment on existing lines add a
little, but on big files, on some awks, hashing could speed
things up a lot, especially if it hits swap.

For anything with more than 5K lines, you may want to check it's OK.
For more than 50K lines, only use after you are sure you can live with
the CPU, and RAM usage.
Over 500Klines, it may start swapping on some boxes, as well as taking
time.

(this is largely numbers of unique lines)

(figures for gawk 3.0.4, and 128M ram.)

--

---------------------------+-------------------------+--------------------------
Q: What do you call a train that doesn't stop at stations?
A: Thomas the Bastard.                                                -- Ben



Tue, 29 Apr 2003 09:22:04 GMT  
 !x[$0]++
of course (on *nix), sort -u will give unique on the
selected keys, and will work for any size file (up
to available disk space of course) but will re-order
the file.
just a thought
jennifer


Tue, 29 Apr 2003 09:32:21 GMT  
 !x[$0]++

Quote:

>of course (on *nix), sort -u will give unique on the
>selected keys, and will work for any size file (up
>to available disk space of course) but will re-order
>the file.

And if you don't want to have it reordered, you can use the 'uniq'
command. Good for log analysis, especially with the '-c' flag which
prepends each unique line with a count of how many times it appears..

/Viktor...



Tue, 29 Apr 2003 03:00:00 GMT  
 !x[$0]++


<snip>

Quote:
>And if you don't want to have it reordered, you can use the 'uniq'
>command. Good for log analysis, especially with the '-c' flag which
>prepends each unique line with a count of how many times it appears..

What do you mean by 'reordered'? My uniq man page includes the
following neat the top of the description:

"uniq prints the unique lines in a sorted file"

So to get anything useful from a general text file passed through uniq,
it's necessary to sort it (thus reorder it) first. Maybe you have a
nonstandard uniq?

Sent via Deja.com http://www.deja.com/
Before you buy.



Tue, 29 Apr 2003 03:00:00 GMT  
 !x[$0]++

Quote:

>>And if you don't want to have it reordered, you can use the 'uniq'
>>command. Good for log analysis, especially with the '-c' flag which
>>prepends each unique line with a count of how many times it appears..
>What do you mean by 'reordered'? My uniq man page includes the
>following neat the top of the description:
>"uniq prints the unique lines in a sorted file"
>So to get anything useful from a general text file passed through uniq,
>it's necessary to sort it (thus reorder it) first. Maybe you have a
>nonstandard uniq?

My posting was in reply to another post that suggested using 'sort
-u', and said that the reordering thus made might be unwanted. My
suggestion about using uniq was a way to avoid that reordering
(i.e. sorting) of the file. This is, as you write, because uniq will
not sort the file in any way.

This might be very useful if you want to find the number of unique
rows that follow after each other, and not the number of unique
rows in the entire file.

I was however perhaps unclear about giving a solution to another
problem than the originally stated, which I hope have not caused
serious confusion among possible readers.

/Viktor...



Wed, 30 Apr 2003 03:00:00 GMT  
 !x[$0]++
Yes, you are right, uniq doesn't sort the file - however
if the file isn't sorted, it doesn't drop out the duplicates
Harlan quoted the man page on this, but you should check it
yourself.

Jennifer
--

Quote:


> >>And if you don't want to have it reordered, you can use the 'uniq'
> >>command. Good for log analysis, especially with the '-c' flag which
> >>prepends each unique line with a count of how many times it appears..

> >What do you mean by 'reordered'? My uniq man page includes the
> >following neat the top of the description:

> >"uniq prints the unique lines in a sorted file"

> >So to get anything useful from a general text file passed through uniq,
> >it's necessary to sort it (thus reorder it) first. Maybe you have a
> >nonstandard uniq?

> My posting was in reply to another post that suggested using 'sort
> -u', and said that the reordering thus made might be unwanted. My
> suggestion about using uniq was a way to avoid that reordering
> (i.e. sorting) of the file. This is, as you write, because uniq will
> not sort the file in any way.

> This might be very useful if you want to find the number of unique
> rows that follow after each other, and not the number of unique
> rows in the entire file.

> I was however perhaps unclear about giving a solution to another
> problem than the originally stated, which I hope have not caused
> serious confusion among possible readers.

> /Viktor...



Fri, 02 May 2003 09:52:26 GMT  
 !x[$0]++

Quote:




> > > > And if you don't want to have it reordered, you can use the 'uniq'
> > > > command. Good for log analysis, especially with the '-c' flag which
> > > > prepends each unique line with a count of how many times it appears.

> > > What do you mean by 'reordered'? My uniq man page includes the
> > > following neat the top of the description:

> > > "uniq prints the unique lines in a sorted file"

> > > So to get anything useful from a general text file passed through uniq,
> > > it's necessary to sort it (thus reorder it) first. Maybe you have a
> > > nonstandard uniq?

> > My posting was in reply to another post that suggested using 'sort
> > -u', and said that the reordering thus made might be unwanted. My
> > suggestion about using uniq was a way to avoid that reordering
> > (i.e. sorting) of the file. This is, as you write, because uniq will
> > not sort the file in any way.

> > This might be very useful if you want to find the number of unique
> > rows that follow after each other, and not the number of unique
> > rows in the entire file.

> > I was however perhaps unclear about giving a solution to another
> > problem than the originally stated, which I hope have not caused
> > serious confusion among possible readers.

> Yes, you are right, uniq doesn't sort the file - however
> if the file isn't sorted, it doesn't drop out the duplicates
> Harlan quoted the man page on this, but you should check it
> yourself.

The uniq command operates exactly the same on sorted and unsorted
input: it removes repeated adjacent lines (or, if invoked with the
-d option, reports which adjacent lines are repeated). There is no
requirement that the input be ordered for uniq to function correctly
and usefully.

The BSD uniq(1) man page is more precise than the one quoted above:

  NAME
       uniq - report or filter out repeated lines in a file

  DESCRIPTION
       The uniq utility reads the standard input comparing adjacent
       lines, and writes a copy of each unique input line to the
       standard output. The second and succeeding copies of identical
       adjacent input lines are not written. Repeated lines in the
       input will not be detected if they are not adjacent, so it
       may be necessary to sort the files first.

Using uniq to "[print] the unique lines in a sorted file" is just
one application of the command's more general functionality. In
his original contribution to this thread (quoted above), Viktor
gave an example of an application of uniq for which the input is
generally not sorted; namely, log file analysis. I often use uniq
to collapse redundant lines in large amounts of input before piping
it to another command (typically awk or Perl) to perform some kind
of analysis on the fewer, unsorted lines.

Indeed, because the need for this

    sort | uniq

is precluded by this

    sort -u

one could argue that the uniq command is more often useful for, as
Viktor suggested, using

    uniq -c

to count how many times each repeated line occurs, or using

    sort | uniq -c

to generate a frequency list, or for operating on unsorted input
to remove repeated adjacent lines.

--
Jim Monty

Tempe, Arizona USA



Fri, 02 May 2003 14:38:55 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Appending to $0 or $_

2. variable as search-pattern: if ($0 ~ myVar)

3. executable awk scripts and shell var $(basename $0)

4. Newbie Help required: Deleting double entries in a line ($0)

5. mechanism of application ($0 or the line)

6. $0 == false ??

7. program name -> perl's $0

8. $0 == __FILE__ idiom for Unix only?

9. ruby 1.6.5, irb $0

10. __FILE__ == $0 refactored in Ruby?

11. Replacement for __FILE__ == $0

12. Maximum size of $0

 

 
Powered by phpBB® Forum Software