awk/gawk stop after processing 40mill lines 
Author Message
 awk/gawk stop after processing 40mill lines

i've a file with approx. 100 million lines of data;  in processing the
file, awk & gawk both just simply quit after about 40 million lines.

example input:  234438.12    3456234.59     1020.28 35    234434.12
3456237.59     1021.28    35

example output:  102028,234438.12,3456234.59

with following script:  {print $3*100,$1,$2}  (simplified version;  in
final program, there are some conditional statements, but I've tried
even this bare-bones script and get the same result)

I've even tried piping the output to 3 separate files, i.e.

if(NR <= 30000000) print > "temp1.txt"
if(NR >30000000 && NR <=60000000) print > "temp2.txt"

etc...

disk space isn't the issue:  input file is 5.5 gigs, output file should
be ~2.5 gigs (though it's only ~1 gig 'cause it's processing only about
40% of the file), and I've just over 8 gigs on the disk;  anyway, "df
-k" after awk quits running reveals that I've well over a gig on
relavant disk.  I'm running program on SunOS 5.7;  swap space is on
another disk/partition, though I'm not sure what it is.  BUT:  a man on
"largefile" tells me that awk is of this family, so input file size/swap
should be irrelavant, right?

confused,
john



Sat, 14 Dec 2002 03:00:00 GMT  
 awk/gawk stop after processing 40mill lines


Quote:
>i've a file with approx. 100 million lines of data;  in processing the
>file, awk & gawk both just simply quit after about 40 million lines.

>example input:  234438.12    3456234.59     1020.28 35    234434.12
>3456237.59     1021.28    35

>example output:  102028,234438.12,3456234.59

>with following script:  {print $3*100,$1,$2}  (simplified version;  in
>final program, there are some conditional statements, but I've tried
>even this bare-bones script and get the same result)

I strongly suspect a data-related error e.g. a binary EOF marker.
Try to find out where it is and then extract those lines

Mark
---
Mark Katz
Mark-it, London. Delivering MR-IT/Internet solutions
Tel: (44) 20-8731 7516, Fax: (44) 20-8458 9554
For latest information about ISPC/ITE - see http://www.e-tabs.com



Sat, 14 Dec 2002 03:00:00 GMT  
 awk/gawk stop after processing 40mill lines

Quote:

> i've a file with approx. 100 million lines of data;  in processing the
> file, awk & gawk both just simply quit after about 40 million lines.

<snip>

Quote:
>disk space isn't the issue:  input file is 5.5 gigs, output file should
>be ~2.5 gigs (though it's only ~1 gig 'cause it's processing only about
>40% of the file), and I've just over 8 gigs on the disk;  anyway, "df
>-k" after awk quits running reveals that I've well over a gig on
>relavant disk.  I'm running program on SunOS 5.7;  swap space is on
>another disk/partition, though I'm not sure what it is.  BUT:  a man on
>"largefile" tells me that awk is of this family, so input file
>size/swap should be irrelavant, right?

For the heck of it, add the pattern/action

NR % 1e7 == 0 { system("df -k >> check_df") }

to trace disk free space during script execution. I suspect you are
running out of disk space, and this'd be one way to check.

Are you deleting the output file from the previous attempt before each
new attempt? I'll admit my inode ignorance, but if a large
file 'output' exists and a script like 'do_something > output' is run,
do unix-like OS's immediately free up all disk storage used by 'output'?

Sent via Deja.com http://www.deja.com/
Before you buy.



Sat, 14 Dec 2002 03:00:00 GMT  
 awk/gawk stop after processing 40mill lines
Some further thoughts....


Quote:
>i've a file with approx. 100 million lines of data;  in processing the
>file, awk & gawk both just simply quit after about 40 million lines.
>example input:  234438.12    3456234.59     1020.28 35    234434.12
>3456237.59     1021.28    35
>example output:  102028,234438.12,3456234.59

>with following script:  {print $3*100,$1,$2}  (simplified version;  in
>final program, there are some conditional statements, but I've tried
>even this bare-bones script and get the same result)

How "bare bones"
have you tried '{print}' which will take up space, or
 'END{print NR}'

How about a wc filename

Or 'split filename' to Detect any file/system related problems

Quote:

>I've even tried piping the output to 3 separate files, i.e.

>if(NR <= 30000000) print > "temp1.txt"
>if(NR >30000000 && NR <=60000000) print > "temp2.txt"

Better to use split (man split) just in case awk *is* playing up
Mark
---
Mark Katz
Mark-it, London. Delivering MR-IT/Internet solutions
Tel: (44) 20-8731 7516, Fax: (44) 20-8458 9554
For latest information about ISPC/ITE - see http://www.e-tabs.com


Sun, 15 Dec 2002 03:00:00 GMT  
 awk/gawk stop after processing 40mill lines

Hi John...
Here's something to try...
rather than:
  gawk -f program input_file > output_file
please try:
  cat input_file | gawk -f program - > output_file

I have this sneeking suspicion you are hitting a 2GB
file size limit on the INPUT file!
reading on stdin should avoid this.
(large file (>2gb is still not default in the C libraries)
you might also want to try 3.05 (just announced in another thread)
as that also has some discussion of large file support being
added/fixed for one of the platforms

Jennifer
--

Quote:

> i've a file with approx. 100 million lines of data;  in processing the
> file, awk & gawk both just simply quit after about 40 million lines.

> example input:  234438.12    3456234.59     1020.28 35    234434.12
> 3456237.59     1021.28    35

> example output:  102028,234438.12,3456234.59

> with following script:  {print $3*100,$1,$2}  (simplified version;  in
> final program, there are some conditional statements, but I've tried
> even this bare-bones script and get the same result)

> I've even tried piping the output to 3 separate files, i.e.

> if(NR <= 30000000) print > "temp1.txt"
> if(NR >30000000 && NR <=60000000) print > "temp2.txt"

> etc...

> disk space isn't the issue:  input file is 5.5 gigs, output file should
> be ~2.5 gigs (though it's only ~1 gig 'cause it's processing only about
> 40% of the file), and I've just over 8 gigs on the disk;  anyway, "df
> -k" after awk quits running reveals that I've well over a gig on
> relavant disk.  I'm running program on SunOS 5.7;  swap space is on
> another disk/partition, though I'm not sure what it is.  BUT:  a man on
> "largefile" tells me that awk is of this family, so input file size/swap
> should be irrelavant, right?

> confused,
> john



Mon, 16 Dec 2002 03:00:00 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. processing one line in a file at a time using awk

2. AWK: Problem processing first line

3. awk process in awk ??

4. awk -- pattern match a line and the line that follows

5. awk/gawk textfile-problem

6. awk, nawk & gawk

7. Heuristic search algorithms in awk/gawk

8. AWK or GAWK ?

9. AWK/NAWK/GAWK questions !

10. GNU Awk (gawk) 3.0.5 now available

11. GAWK vs AWK

12. Looking for NT awk/gawk

 

 
Powered by phpBB® Forum Software