splitting a large file based on the change of values in a key field 
Author Message
 splitting a large file based on the change of values in a key field



Quote:
>Greetings all,

>I have a large comma delimited text file (call it fileA)that I wish to split
>into smaller parts for further processing.  The original file is sorted on
>the key field (the field I want to split on).  I have a second file (call it
>fileB) that contains all of the unique values of this key field if this would
>help.

<snip>
>FileA

>A,1,2,3,4,5,6
>A,2,3,4,5,6,7
>A,3,4,5,6,7,8
>B,1,2,3,4,5,6
>B,6,5,4,3,2,1
>C,9,8,7,6,5,4
>C,8,6,8,5,8,4

>fileB (if necessary)
>A
>B
>C

>Desired Results

>A.txt
>A,1,2,3,4,5,6
>A,2,3,4,5,6,7
>A,3,4,5,6,7,8

>B.txt
>B,1,2,3,4,5,6
>B,6,5,4,3,2,1

>C.txt
>C,9,8,7,6,5,4
>C,8,6,8,5,8,4

nawk -F, '{print $0 > $1.txt}' FileA

This will work so long as there are not too many different $1 fields
(basically upto the limit of the number of open files which your
OS/AWK implementation will allow).

If you need more files, then consider this:
Since the file is sorted, you know that all the 'A's come first.  In
this case, you know you can close the file when the first field
changes.
Like this:
lastfile != $1 { close(lastfile".txt" ; lastfile = $1 }

--
Steve



Wed, 12 Sep 2001 03:00:00 GMT  
 splitting a large file based on the change of values in a key field

Quote:

>Greetings all,

>I have a large comma delimited text file (call it fileA)that I wish to split
>into smaller parts for further processing.  The original file is sorted on
>the key field (the field I want to split on).  I have a second file (call it
>fileB) that contains all of the unique values of this key field if this would
>help.
>Here's an example of the data I have and the results I'm looking for.

>FileA

>A,1,2,3,4,5,6
>A,2,3,4,5,6,7
>A,3,4,5,6,7,8
>B,1,2,3,4,5,6
>B,6,5,4,3,2,1
>C,9,8,7,6,5,4
>C,8,6,8,5,8,4

awk -F, '{print >$1}' FileA

you may have to look into using nawk and/or the 'close()' function
if there are too many output files to open. Old awk has a limit of
about ten, new awk can have more open (a hundred?) simultaneously,
more than that and you need new awk's 'close', or perl...

Hope that helps,
Douglas Wilson



Wed, 12 Sep 2001 03:00:00 GMT  
 splitting a large file based on the change of values in a key field


Quote:


>>Greetings all,

>>I have a large comma delimited text file (call it fileA)that I wish to split
>>into smaller parts for further processing.  The original file is sorted on
>>the key field (the field I want to split on).  I have a second file (call it
>>fileB) that contains all of the unique values of this key field if this would
>>help.

><snip>
>>FileA

>>A,1,2,3,4,5,6
>>A,2,3,4,5,6,7
>>A,3,4,5,6,7,8
>>B,1,2,3,4,5,6
>>B,6,5,4,3,2,1
>>C,9,8,7,6,5,4
>>C,8,6,8,5,8,4

>>fileB (if necessary)
>>A
>>B
>>C

>>Desired Results

>>A.txt
>>A,1,2,3,4,5,6
>>A,2,3,4,5,6,7
>>A,3,4,5,6,7,8

>>B.txt
>>B,1,2,3,4,5,6
>>B,6,5,4,3,2,1

>>C.txt
>>C,9,8,7,6,5,4
>>C,8,6,8,5,8,4
>nawk -F, '{print $0 > $1.txt}' FileA

To be pedantic, this should read

   nawk -F, '{print $0 > $1".txt"}' FileA

Also the $0 is unecessary

As you say many awks get unhappy if they have
more than 10 files open concurrently...

FileA need not be pre-sorted...
Mark
--
Mark Katz
ISPC, London - Innovation in data-delivery tools
Tel: (44) 181-455 4665, Fax (44) 181-458 9554
** See our website at http://www.efiche.com **



Wed, 12 Sep 2001 03:00:00 GMT  
 splitting a large file based on the change of values in a key field

Quote:

> To be pedantic, this should read

>    nawk -F, '{print $0 > $1".txt"}' FileA

> Also the $0 is unecessary

Also, the -F, is MORE unnecessary. ;-)

--
Jim Monty

Tempe, Arizona USA



Thu, 13 Sep 2001 03:00:00 GMT  
 splitting a large file based on the change of values in a key field

% I am going to add a new level of complexity.  I never mentioned this in my
% first post but may $1 field doesn't contain simple data like A,B,C instead it
% has values like the following:
%
% 'abc def',1,2,3,4,'aaaaaa'
% 'abc df',2,3,4,5,'bbbb'
% 'cde/df',9,8,7,6,'qqq'
% 'a.bb',3,4,5,6,'sw'
%
% I don't mind if the output files are called something like A.txt, B.txt so
% long as I can associate them with the correct value for $1.  Unfortunately

So, you've got fileA as above, and fileB, which is like this:
 'abc def',A.txt

Your awk script could be like this:
 BEGIN {
  FS=","
  lastfile = ""

  # load up the output file names
  while ((getline < "fileB") > 0)
    fnames[$1] = $2
 }
 fnames[$1] != lastfile {
    close(lastfile)
    lastfile = fnames[$1]
 }
 { print >> lastfile }

This will work fine provided the input file is sorted, however it will
continuously open and close files if the input file is not sorted, so
you might want to use some kind of least-recently used algorithm to
keep track of all the open files, like this:

 BEGIN {
  FS=","
  opencount = 0
  openlimit = 10  # we'll limit ourselves to 10 open output files
  lastfile = ""
  lru = 0

  # load up the output file names
  while ((getline < "fileB") > 0)
    fnames[$1] = $2
 }

 fnames[$1] != lastfile {
    opencount++;
    if (opencount > openlimit) {

      i = lru
      # find the least recently used file and bop it
      for (a in openfiles) {
        if (openfiles[a] < i) {
           i = openfiles[a]
           ftc = a
        }
      }

      close(ftc)
      delete openfiles[ftc]
      opencount--
    }
    lastfile = fnames[$1]
 }

 { openfiles[lastfile] = ++lru; print >> lastfile }
--

Patrick TJ McPhee
East York  Canada



Thu, 13 Sep 2001 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. splitting a very large file based on characters in a record (performance)

2. Make changes based on values from separate file

3. Split keys using non-contiguous fields in VMS Cobol - David Early

4. Split with Regular Expressions to get Key-Value Pair

5. Comparing value in an input field to any value from another file

6. Newbie needs help splitting large file

7. splitting large files (>600Mo)

8. As to attribute value to the first field of a key composed in a Browse

9. report based on field value user inputs at run time

10. report based on field value user inputs at run time

11. Speed problem when I change a not key field

12. Fields changing upon pushing OK key

 

 
Powered by phpBB® Forum Software