cat file | sort | uniq | wc -l has different results than my perl grep test 
Author Message
 cat file | sort | uniq | wc -l has different results than my perl grep test

I need to extract all unique lines from a file in order to populate a
database. The file has many duplicate lines in it ( alot of headers ,
and actual data ) . I have been using ( On redHat Linux ) a bash script
to do the sorting and t{*filter*}:
#cat original | sort | uniq > newfile

And then using a perl script to parse the newfile and populate my
database. I would like to have Perl do the whole operation, and my
research leads me to belive that the grep function will handle this..
_______________________
#!/usr/bin/perl -w
my $counts;

open(LOG,"/tmp/UPLOAD/borrow.report") || die "Cant open access log:



        grep {
                $_ =~ /.*/ and (++$counts{$_} < 2) ;


        print "$_\n" if $_;

Quote:
}

_________________

However I get a different,smaller count ( from wc -l ) if I use Perl
than if I use sort | uniq . I cant really afford to lose any data, and
the results are still 4000+ lines which is hard for me to manually find
the discrepancy, does anyone see anything blatently wrong about my
script?



Thu, 12 Aug 2004 23:33:52 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:

> I need to extract all unique lines from a file in order to populate a
> database. The file has many duplicate lines in it ( alot of headers ,
> and actual data ) . I have been using ( On redHat Linux ) a bash script
> to do the sorting and t{*filter*}:
> #cat original | sort | uniq > newfile

> And then using a perl script to parse the newfile and populate my
> database. I would like to have Perl do the whole operation, and my
> research leads me to belive that the grep function will handle this..
> _______________________
> #!/usr/bin/perl -w
> my $counts;

Gives warning -- should be

    my %counts;

Quote:

> open(LOG,"/tmp/UPLOAD/borrow.report") || die "Cant open access log:
> $!\n";



>         grep {
>                 $_ =~ /.*/ and (++$counts{$_} < 2) ;

                  ^^^^^^^^^^^^^^
That part of the expression does nothing and is useless.

Quote:


>         print "$_\n" if $_;
> }
> _________________

> However I get a different,smaller count ( from wc -l ) if I use Perl
> than if I use sort | uniq . I cant really afford to lose any data, and
> the results are still 4000+ lines which is hard for me to manually find
> the discrepancy, does anyone see anything blatently wrong about my
> script?

Other than the minor comments above, it looks like it ought to do the
job, as long as none of your lines are null or zero.  It should be easy
to compare the results of your Linux shell method with the Perl method
by using the Linux diff command (sorting the output from the Perl
program first so the two outputs are in the same order).  Perhaps with
that result it will be clear where the problem lies.  A short test
script indicates that your Perl program functions as it should.

There is only one potential difference I see:  If a line is null or
consists of just the digit 0, it will not be output by your program.
You might consider removing the "if $_" from the end of the print
statement if you wish to avoid that problem.

You might be able to make your shell script more efficient by using sort
-u and getting rid of the call to uniq.
--
Bob Walton



Fri, 13 Aug 2004 00:36:17 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:

> I need to extract all unique lines from a file in order to populate a
> database. The file has many duplicate lines in it ( alot of headers ,
> and actual data ) . I have been using ( On redHat Linux ) a bash script
> to do the sorting and t{*filter*}:
> #cat original | sort | uniq > newfile

ready for the "UUoCA"? (Useless Use of Cat-Award)
=)
sort -u testdatei |wc -l
(ok, some versions may not have the -u option, but you can
leave out the cat because sort can take a file or STDIN

Quote:
> And then using a perl script to parse the newfile and populate my
> database. I would like to have Perl do the whole operation, and my
> research leads me to belive that the grep function will handle this..
> _______________________
> #!/usr/bin/perl -w
> my $counts;

you're not using this any more. -w is warning you
about that.
i recommend "use strict".
then you'd have to say:

Quote:
> open(LOG,"/tmp/UPLOAD/borrow.report") || die "Cant open access log:
> $!\n";



>         grep {
>                 $_ =~ /.*/ and (++$counts{$_} < 2) ;


>         print "$_\n" if $_;
> }

you're leaving out blank lines in the print. i'd check for
defined $_. (you're also leaving out lines with only one zero in it)

i'd rewrite it as the following:
use strict;
my %counts;
open LOG,"file" or die "Cant open access log: $!\n";
while (<LOG>) {
 chomp;
 $counts{$_}++;

Quote:
}

close LOG;
$,="\n";
print keys %counts;
print "\n";

note that the order here gets lost, so you might better go with
your array version if you need the order

Quote:
> However I get a different,smaller count ( from wc -l ) if I use Perl
> than if I use sort | uniq . I cant really afford to lose any data, and
> the results are still 4000+ lines which is hard for me to manually find
> the discrepancy, does anyone see anything blatently wrong about my
> script?

if it contains blank lines, that's the error.
hth, tina

--
http://www.*-*-*.com/                    \  enter__| |__the___ _ _ ___
PerlQuotes: http://www.*-*-*.com/ \     / _` / _ \/ _ \ '_(_-< of
MovieDB: http://www.*-*-*.com/ ; \    \ _,_\ __/\ __/_| /__/ perception



Fri, 13 Aug 2004 00:49:01 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:

>        grep {
>                $_ =~ /.*/ and (++$counts{$_} < 2) ;

>the results are still 4000+ lines which is hard for me to manually find
>the discrepancy,

So don't do it manually then  :-)

'diff' can compare files. Put the 2 outputs into files and diff them.

Knowing the differences is likely essential to identifying the problem.

Quote:
>does anyone see anything blatently wrong about my
>script?

Yes. Your pattern matches every possible string.

grep() will select the same things even if you leave that
match out altogether...

Also, I am wondering why you are not starting with the code
given in the answer to your Frequently Asked Question:

   perldoc -q duplicate

      "How can I remove duplicate elements from a list or array?"

--
    Tad McClellan                          SGML consulting

    Fort Worth, Texas



Fri, 13 Aug 2004 01:00:54 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:
> Also, I am wondering why you are not starting with the code
> given in the answer to your Frequently Asked Question:

>    perldoc -q duplicate

>       "How can I remove duplicate elements from a list or array?"

First of all, thanks everyone.
Tad, this is a modified version of the only example that "almost" made
sense to me (From RayCosofts page).
Apparently I didnt understand that he/I was using a hash (%count). The
diff command wasnt working for me and I suspect it was because I chomped
my array ...and that may have been throwing off wc also.

So here is what you guys helped me come up with, I appreciate your help:

------------------------------------------------------------------------
------------------------
#!/usr/bin/perl -w
use strict;
my %counts;

open(LOG,"/tmp/UPLOAD/borrow.report") || die "Cant open access log:
$!\n";

open(OUT,">/tmp/results") || die "Cannot write to /tmp/results: $!\n";


________________________________________________________________________
_



Fri, 13 Aug 2004 01:35:30 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:
> I need to extract all unique lines from a file in order to populate a
> database. The file has many duplicate lines in it ( alot of headers ,
> and actual data ) . I have been using ( On redHat Linux ) a bash script
> to do the sorting and t{*filter*}:

How about "perldoc -q duplicate"
      How can I remove duplicate elements from a list or array?

jue



Fri, 13 Aug 2004 19:14:29 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:

> I have been using ( On redHat Linux ) a bash script to do the sorting
> and t{*filter*}:
> #cat original | sort | uniq > newfile

You can trim this down considerably:
        sort -u <original >newfile

Or, if you meant what you wrote in the subject line:
        sort -u <original | wc -l

That's not a perl solution. However, if it works for you then why spend
time reinventing the wheel using perl? Hubris is a virtue.

Chris



Sat, 14 Aug 2004 11:46:07 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test

Quote:


> > I have been using ( On redHat Linux ) a bash script to do the sorting
> > and t{*filter*}:
> > #cat original | sort | uniq > newfile

> You can trim this down considerably:
>         sort -u <original >newfile

> Or, if you meant what you wrote in the subject line:
>         sort -u <original | wc -l

> That's not a perl solution. However, if it works for you then why spend
> time reinventing the wheel using perl? Hubris is a virtue.

> Chris

Well, if you had a huge data file to apply this to, a typical Perl
program can be much more efficient by not sorting the data, and not
storing it all in memory or temporary disk files, as sort must do.  This
reduction in memory usage will happen only if, as the OP stated, there
are lots of duplicate lines.  One could perhaps reduce memory usage
further at a small probability of treating non-duplicate lines as
duplicates by using the MD5 digest of each line rather than the lines
themselves as the hash keys.  Further, the Perl program can preserve the
original order of the records, while sort will lose that (unless they
are already sorted).
--
Bob Walton


Sun, 15 Aug 2004 03:43:00 GMT  
 cat file | sort | uniq | wc -l has different results than my perl grep test
I suggested,

Quote:
>         sort -u <original | wc -l


Quote:
> [...] a typical Perl program can be much more efficient by not sorting
> the data [etc...].

Agreed in this case, since as you pointed out, that:

Quote:
> [...] the OP stated, there are lots of duplicate lines.
> Further, the Perl program can preserve the original order of the
> records, while sort will lose that [...]

Also, true, but in the OP's case this (presumably) isn't a requirement.

On my platform it appears that sort -u is more efficient than sort |
uniq, so I stand by my comment. Also, consider which alternative is
easier to write and type [assuming equal knowledge].

$ time sh -c 'for F in 0 1 2 3 4 5 6 7 8 9; do sort biglog|uniq|wc -l; done'
        17547
        ...
        39.42user 4.64system 0:45.06elapsed 97%CPU

$ time sh -c 'for F in 0 1 2 3 4 5 6 7 8 9; do sort -u biglog|wc -l; done'
        17547
        ...
        13.97user 3.43system 0:18.95elapsed 91%CPU

(I repeated both samples once more to ensure that both had a fair chance
at the disk cache. Similar results ensued.)

Regards,
Chris



Sun, 15 Aug 2004 10:32:10 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. cat filename | wc -l perl equivalent

2. best way of doing sort|uniq (sort -u) in perl

3. Different Results With Different Browsers

4. sort -uniq in perl?

5. sort and uniq in Perl

6. Perl sort different from unix sort

7. win32: accessing file in same Drive where cgi-bin reside return different result

8. sort, uniq -c

9. how to sort uniq

10. sort numericaly and uniq like Korn shell

11. Problem with grep, -s file test?

12. file test inside grep()

 

 
Powered by phpBB® Forum Software