need to format a variable delimited file to fixed length is AWK best choice? 
Author Message
 need to format a variable delimited file to fixed length is AWK best choice?

We need to convert a file from a variable lenght delimted format to a fixed
length format.

we have a 24 gig file in 12 pieces.  Currently the file is variable length
delimited by |  with fields enclosed with ^
sample:
^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

we need to convert this to a fixed lenght format with each field taking a
certain predined lenght.  
sample
  717764002   71776401.    2000-09-11-19.23.00.000000           2000-05-25
       300102         30011.   2000-06-28-19.57.29.670634           2000-05-30

I am sure that this can be done in AWK  couple of questions:

1 - would this involved just using reading and then print to format.  How to
handle the delimiters and ^.   Would every single character need to be
processed and examined?  Is there are way to avoid that

2 - is AWK able to do this very quickly?  i am looking at around 20-25 mins
per 2 gig file.



Sun, 08 Feb 2004 02:41:13 GMT  
 need to format a variable delimited file to fixed length is AWK best choice?


Quote:
>We need to convert a file from a variable lenght delimted format to a fixed
>length format.

>we have a 24 gig file in 12 pieces.  Currently the file is variable length
>delimited by |  with fields enclosed with ^
>sample:
>^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
>^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

>we need to convert this to a fixed lenght format with each field taking a
>certain predined lenght.  
>sample
>  717764002   71776401.    2000-09-11-19.23.00.000000           2000-05-25
>       300102         30011.   2000-06-28-19.57.29.670634           2000-05-30

>I am sure that this can be done in AWK  couple of questions:

>1 - would this involved just using reading and then print to format.  How to
>handle the delimiters and ^.   Would every single character need to be
>processed and examined?  Is there are way to avoid that

>2 - is AWK able to do this very quickly?  i am looking at around 20-25 mins
>per 2 gig file.

You might like to try this. Put the following script in a file called
reform.awk.

#!gawk -f

BEGIN {
  FS="\\^\\|\\^"

Quote:
}

{
  sub(/\^$/,"",$0)
  sub(/^\^/,"",$0)
  printf "%10s%10s%27s%11s\n", $1,$2,$3,$4

Quote:
}

I am not sure what size fields you want so you may need to adjust them.

To run this
chmod +x reform.awk
./reform.awk inputfile > outputfile

or

gawk -f reform.awk inputfile > outputfile

The output should look like this
 717764002 71776401. 2000-09-11-19.23.00.000000 2000-05-25
    300102    30011. 2000-06-28-19.57.29.670634 2000-05-30

I recommend gawk if you are using unix because it tends to be faster
than alternatives. On a PC mawk is said to be faster. I am unable to
predict how long it might take on your system.

You're not using a network drive or NFS mounted drive are you? I would
think any program working with such big files will go a lot faster if
the drive or partition is local. The bottlenecks are likely to be
network, disk drive and CPU/program in descending order of importance.
--
Alan Linton



Sun, 08 Feb 2004 04:35:50 GMT  
 need to format a variable delimited file to fixed length is AWK best choice?
Alan Linton wrote at Tuesday 21 August 2001 22:35 like only he can:

Quote:


>>We need to convert a file from a variable lenght delimted format to
>>a fixed length format.

>>we have a 24 gig file in 12 pieces.  Currently the file is variable
>>length
>>delimited by |  with fields enclosed with ^
>>sample:
>>^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
>>^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

>>we need to convert this to a fixed lenght format with each field
>>taking a certain predined lenght.
>>sample
>>  717764002   71776401.    2000-09-11-19.23.00.000000          
>>  2000-05-25
>>       300102         30011.   2000-06-28-19.57.29.670634          
>>       2000-05-30

>>I am sure that this can be done in AWK  couple of questions:

>>1 - would this involved just using reading and then print to format.
>> How to
>>handle the delimiters and ^.   Would every single character need to
>>be
>>processed and examined?  Is there are way to avoid that

>>2 - is AWK able to do this very quickly?  i am looking at around
>>20-25 mins per 2 gig file.

> You might like to try this. Put the following script in a file
> called reform.awk.

> #!gawk -f

> BEGIN {
>   FS="\\^\\|\\^"
> }

> {
>   sub(/\^$/,"",$0)
>   sub(/^\^/,"",$0)
>   printf "%10s%10s%27s%11s\n", $1,$2,$3,$4
> }

> I am not sure what size fields you want so you may need to adjust
> them.

> To run this
> chmod +x reform.awk
> ./reform.awk inputfile > outputfile

> or

> gawk -f reform.awk inputfile > outputfile

> The output should look like this
>  717764002 71776401. 2000-09-11-19.23.00.000000 2000-05-25
>     300102    30011. 2000-06-28-19.57.29.670634 2000-05-30

> I recommend gawk if you are using unix because it tends to be faster
> than alternatives. On a PC mawk is said to be faster. I am unable to
> predict how long it might take on your system.

> You're not using a network drive or NFS mounted drive are you? I
> would think any program working with such big files will go a lot
> faster if the drive or partition is local. The bottlenecks are
> likely to be network, disk drive and CPU/program in descending order
> of importance.

Using gawk, wouldn't something like this deliver better performance?
I'm not sure...;-)

awk -F\| 'BEGIN{OFS=" "}{gsub("\\^","");printf "%10s%10s%27s%11s\n",
$1,$2,$3,$4}' infile > outfile

Michael Heiming



Sun, 08 Feb 2004 06:16:03 GMT  
 need to format a variable delimited file to fixed length is AWK best choice?

Quote:


> > #!gawk -f

> > BEGIN {
> >   FS="\\^\\|\\^"
> > }

> > {
> >   sub(/\^$/,"",$0)
> >   sub(/^\^/,"",$0)
> >   printf "%10s%10s%27s%11s\n", $1,$2,$3,$4
> > }

> Using gawk, wouldn't something like this deliver better performance?
> I'm not sure...;-)

> awk -F\| 'BEGIN{OFS=" "}{gsub("\\^","");printf "%10s%10s%27s%11s\n",
> $1,$2,$3,$4}' infile > outfile

[Assumption: "[B]etter performance" here means "in less time," not
"in less space."]

Why would this be faster? At a glance, it seems it would be slower,
due to the gsub(). Are you thinking that one gsub() might be faster
than two sub()s? In general, I would not expect that to be true.

There's only one way to know for sure which is fastest (performs
better).

In any case, Alan's script is as well-tuned as it needs to be for
an awk script. Any further micro-optimizations to it won't make an
appreciable enough improvement to justify the time spent by the OP
tweaking and testing--time better spent looking for other ways to
make the job really run faster (several of which Alan alluded to
in his post).

--
Jim Monty

Tempe, Arizona USA



Sun, 08 Feb 2004 07:05:51 GMT  
 need to format a variable delimited file to fixed length is AWK best choice?

Quote:



> > > #!gawk -f

> > > BEGIN {
> > >   FS="\\^\\|\\^"
> > > }

> > > {
> > >   sub(/\^$/,"",$0)
> > >   sub(/^\^/,"",$0)
> > >   printf "%10s%10s%27s%11s\n", $1,$2,$3,$4
> > > }

> > Using gawk, wouldn't something like this deliver better performance?
> > I'm not sure...;-)

> > awk -F\| 'BEGIN{OFS=" "}{gsub("\\^","");printf "%10s%10s%27s%11s\n",
> > $1,$2,$3,$4}' infile > outfile

> [Assumption: "[B]etter performance" here means "in less time," not
> "in less space."]

> Why would this be faster? At a glance, it seems it would be slower,
> due to the gsub(). Are you thinking that one gsub() might be faster
> than two sub()s? In general, I would not expect that to be true.

> There's only one way to know for sure which is fastest (performs
> better).

> In any case, Alan's script is as well-tuned as it needs to be for
> an awk script. Any further micro-optimizations to it won't make an
> appreciable enough improvement to justify the time spent by the OP

Over 24Gb a slight improvement might euqal lots of time.
regards,
Ben


Sun, 08 Feb 2004 21:42:15 GMT  
 need to format a variable delimited file to fixed length is AWK best choice?


Quote:
>Alan Linton wrote at Tuesday 21 August 2001 22:35 like only he can:



>>>We need to convert a file from a variable lenght delimted format to
>>>a fixed length format.

>>>we have a 24 gig file in 12 pieces.  Currently the file is variable
>>>length
>>>delimited by |  with fields enclosed with ^
>>>sample:
>>>^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
>>>^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

>>>we need to convert this to a fixed lenght format with each field
>>>taking a certain predined lenght.
>>>sample
>>>  717764002   71776401.    2000-09-11-19.23.00.000000          
>>>  2000-05-25
>>>       300102         30011.   2000-06-28-19.57.29.670634          
>>>       2000-05-30

>>>I am sure that this can be done in AWK  couple of questions:

>>>1 - would this involved just using reading and then print to format.
>>> How to
>>>handle the delimiters and ^.   Would every single character need to
>>>be
>>>processed and examined?  Is there are way to avoid that

>>>2 - is AWK able to do this very quickly?  i am looking at around
>>>20-25 mins per 2 gig file.

>> You might like to try this. Put the following script in a file
>> called reform.awk.

>> #!gawk -f

>> BEGIN {
>>   FS="\\^\\|\\^"
>> }

>> {
>>   sub(/\^$/,"",$0)
>>   sub(/^\^/,"",$0)
>>   printf "%10s%10s%27s%11s\n", $1,$2,$3,$4
>> }

>> I am not sure what size fields you want so you may need to adjust
>> them.

>> To run this
>> chmod +x reform.awk
>> ./reform.awk inputfile > outputfile

>> or

>> gawk -f reform.awk inputfile > outputfile

>> The output should look like this
>>  717764002 71776401. 2000-09-11-19.23.00.000000 2000-05-25
>>     300102    30011. 2000-06-28-19.57.29.670634 2000-05-30

>> I recommend gawk if you are using unix because it tends to be faster
>> than alternatives. On a PC mawk is said to be faster. I am unable to
>> predict how long it might take on your system.

>> You're not using a network drive or NFS mounted drive are you? I
>> would think any program working with such big files will go a lot
>> faster if the drive or partition is local. The bottlenecks are
>> likely to be network, disk drive and CPU/program in descending order
>> of importance.

>Using gawk, wouldn't something like this deliver better performance?
>I'm not sure...;-)

>awk -F\| 'BEGIN{OFS=" "}{gsub("\\^","");printf "%10s%10s%27s%11s\n",
>$1,$2,$3,$4}' infile > outfile

This might be faster still:

tr -d '^' < infile |
awk -F\| '{printf "%10s%10s%27s%11s\n", $1,$2,$3,$4}' > outfile

Chuck Demas

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Mon, 09 Feb 2004 08:37:49 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Problem when formating a variable delimited file to fixed length

2. Convert comma-delimited records to fixed length records

3. How to convert tab delimited file to space delimited file

4. Inserting First Field in Fixed Delimited File

5. NEED HELP appending two Char variables of variable length

6. Processing variable length/variable data files

7. Variable vs. fixed-length records performance

8. Variable vs. fixed length - tuning question

9. fixed or variable character length

10. database variable-length report to awk?

11. Programmically distinguish delimited and sdf text file format??

12. Help with FORMAT for reading tab delimited files

 

 
Powered by phpBB® Forum Software