faster way? 
Author Message
 faster way?

howdy  folks,
    i've got a file nearly 700 megs in size;  it is six columns of data
that i need to convert to three, i.e.

a b c d e f
g h i j k l

to

a b c
d e f
g h i
j k l

i'm doing so with a very simple script:

awk '{print $1, $2, $3
print $4, $5, $6}'  input > output

problem-  it's taking over 40 minutes to process the file, and I'll soon
be working with datasets that are closer to 7gigs in size which, at this
rate, could take all day to process.

question-  how can I speed it up?  I'm now noticing that the processors
(dual processor SGI Onyx II) are only running at ~10%, so i could,
disassemble the file, process the components simultaneously, then
re-assemble 'em, i suppose...  Other suggestions?   ...in Awk, C, or
whatever would be helpful.

in advance, thanks,
john



Mon, 11 Nov 2002 03:00:00 GMT  
 faster way?


Quote:
> awk '{print $1, $2, $3
> print $4, $5, $6}'  input > output
> problem- it's taking over 40 minutes to process the file, and I'll
> soon be working with datasets that are closer to 7gigs in size
> which, at this rate, could take all day to process.  question- how
> can I speed it up?  I'm now noticing that the processors (dual
> processor SGI Onyx II) are only running at ~10%,

I'm not sure there's much you can do at the Awk script level; you
could try using a different Awk, such as Gawk, or try giving the awk
process more priority with the Unix "nice" etc. command, or try Perl
(which might or might not be any faster...)

--



Mon, 11 Nov 2002 03:00:00 GMT  
 faster way?


Quote:
>howdy  folks,
>    i've got a file nearly 700 megs in size;  it is six columns of data
>that i need to convert to three, i.e.

>a b c d e f
>g h i j k l

>to

>a b c
>d e f
>g h i
>j k l

>i'm doing so with a very simple script:

>awk '{print $1, $2, $3
>print $4, $5, $6}'  input > output

>problem-  it's taking over 40 minutes to process the file, and I'll soon
>be working with datasets that are closer to 7gigs in size which, at this
>rate, could take all day to process.

It is unlikely that anything related to processing speed is the bottleneck.
(I.e., either the script or the quantity/speed of your CPUs)

More likely it is just raw disk I/O - and there's not much you can do about
it.  I doubt that splitting the job into pieces would have anything to do
with it either.

Assuming this is Unix, you can use the 'time' command to tell you what
proportion of the elapsed time is CPU time - my hunch is that you'll find it
is a very small percentage.



Mon, 11 Nov 2002 03:00:00 GMT  
 faster way?

Quote:

>     i've got a file nearly 700 megs in size;  it is six columns of data
> that i need to convert to three, i.e.

> a b c d e f
> g h i j k l

> to

> a b c
> d e f
> g h i
> j k l

> i'm doing so with a very simple script:

> awk '{print $1, $2, $3
> print $4, $5, $6}'  input > output

maybe 'fold' is what you're looking for. Try: man fold

PETE

--
Peter Hanssen, BGS Murchison House, West Mains Road, Edinburgh, EH9 3LA, UK

www: http://i.am/pete



Mon, 11 Nov 2002 03:00:00 GMT  
 faster way?

   >howdy  folks,
   >i've got a file nearly 700 megs in size;  it is six columns of data
   >that i need to convert to three, i.e.
   >a b c d e f
   >g h i j k l
   >to
   >a b c
   >d e f
   >g h i
   >j k l
   >i'm doing so with a very simple script:
   >awk '{print $1, $2, $3
   >print $4, $5, $6}'  input > output
   >problem-  it's taking over 40 minutes to process the file, and I'll
   >soon be working with datasets that are closer to 7gigs in size
   >which, at this rate, could take all day to process.
I don't know if it would be faster, but you could try
awk '{printf "%s %s %s\n%s %s %s",$1,$2,$3,$4,$5,$6}' input > output
or
sed 's/^\([^ ]* [^ ]* [^ ]* \)/\1
/' input > output

Net-Tamer V 1.08X - Test Drive



Mon, 11 Nov 2002 03:00:00 GMT  
 faster way?
Hi John...
I suspect a lot of the time is going to be used in the parsing
of the fields out of the record.  I've looked into this deeply
at one point for gawk, so you might want to give this a try.
one.. in your current script, put "a=NF" before the print.
This will force gawk to fully parse the line (it uses an incremental
parse on demand routine by default).  You might see some performance
improvment by avoiding re-entry to the parser.

Definately use gawk rather than awk.  you might also want to get
the sources rather than the pre-compiled off the sgi freeware site
so you can compile it with the -mips4 flag (yes, I use sgi too :)

You haven't told us very much about the record to help with
the other options for performance.  Are the fields fixed width?
if so then you can avoid parsing fields entirely, or use fold
as another poster suggested.  A dedicated C program for this
would be quite simple (and fast) if you are really planning to
do this opperation a lot.
Also, if your cpu load is as low as you mentioned,
the bottleneck could just be the disk rate.  If your files
are gzip'ed then this improves the overall speed off the disk and
can make better use of the dual cpus. (I'm running an 8 cpu
ChallengeL and my files are gzip'ed on striped disk arrays).

My email is anti-spammed, but feel free to write me directly if
you want to chase any of the non-awk options.

Jennifer
--

Quote:

> howdy  folks,
>     i've got a file nearly 700 megs in size;  it is six columns of data
> that i need to convert to three, i.e.

> a b c d e f
> g h i j k l

> to

> a b c
> d e f
> g h i
> j k l

> i'm doing so with a very simple script:

> awk '{print $1, $2, $3
> print $4, $5, $6}'  input > output

> problem-  it's taking over 40 minutes to process the file, and I'll soon
> be working with datasets that are closer to 7gigs in size which, at this
> rate, could take all day to process.

> question-  how can I speed it up?  I'm now noticing that the processors
> (dual processor SGI Onyx II) are only running at ~10%, so i could,
> disassemble the file, process the components simultaneously, then
> re-assemble 'em, i suppose...  Other suggestions?   ...in Awk, C, or
> whatever would be helpful.

> in advance, thanks,
> john



Tue, 12 Nov 2002 03:00:00 GMT  
 faster way?

Quote:


>   >howdy  folks,
>   >i've got a file nearly 700 megs in size;  it is six columns of data
>   >that i need to convert to three, i.e.
>   >a b c d e f
>   >g h i j k l
>   >to
>   >a b c
>   >d e f
>   >g h i
>   >j k l
>   >i'm doing so with a very simple script:
>   >awk '{print $1, $2, $3
>   >print $4, $5, $6}'  input > output
>   >problem-  it's taking over 40 minutes to process the file, and I'll
>   >soon be working with datasets that are closer to 7gigs in size
>   >which, at this rate, could take all day to process.
>I don't know if it would be faster, but you could try
>awk '{printf "%s %s %s\n%s %s %s",$1,$2,$3,$4,$5,$6}' input > output
>or
>sed 's/^\([^ ]* [^ ]* [^ ]* \)/\1
>/' input > output

or

awk '{print $1, $2, $3 "\n" $4, $5, $6}' infile > outfile

which has only one print statement rather than two.

you might also try:

awk '{$4 = "\n" $4; print}' infile > outfile

and

awk '$4 = "\n" $4' infile > outfile

I tried timing some of these, and for me the only thing that was clear
was that the sed solution was much slower.

It also seemed that the original

awk '{print $1, $2, $3
      print $4, $5, $6}' infile > outfile

was slower than:

awk '{print $1, $2, $3; print $4, $5, $6}' infile > outfile

which was fastest.

I tested this using a 120,000 line infile.

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Tue, 12 Nov 2002 03:00:00 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. fast, fast, fast sin, cos, sqrt - OSI open source code

2. if your Clipper ruN faster (read fastest)

3. functional programs faster than fastest microprocessor?

4. Fast divider - is it the fastest?

5. PROLOG faster than fastest microprocessor.

6. Fast (and I mean fast) DB access from lisp

7. Top Ten ways to shoot yourself in the foot

8. ways to collect data from two collections of the same size

9. 27 MILLION WAYS To Make Money

10. How many ways can you get change from N dollars

11. The Many Ways of Sine

12. Object3D.Shape - ways to access/control multiple shapes?

 

 
Powered by phpBB® Forum Software