Suggestions for script? 
Author Message
 Suggestions for script?

Hi,

I am processing a single huge text file (~120,000 lines) with the
following format:

Quote:
>gi[header1]

[text1]
[text2]
[text3]
Quote:
>gi[header2]

[text4]
[text5]
Quote:
>gi[headern]

.
.
[textn]

what i want to do is parse the file and save the ">gi" line (along
with the subsequent text below it), up until the next ">gi" line, to
a separate file, then do the same for the next ">gi" line so i would
end up with something like the following (resulting in many files):

File1:                    File 2:                File n:

Quote:
>gi[header1]              >gi[header2]           >gi[headern]

[text1]                   [text4]                .
[text2]                   [text5]                .
[text3]                                          [textn]

the trick is there is not a constant number of lines in between
the ">gi[header]" lines.  in other words, the number of lines can
vary between 3 and 100 between these "header" lines, in no predictable
fashion.  all header lines contain ">gi" and the rest of the header is
completely variable.

please excuse the  messiness of my explanation. any pointers on how
to implement an awk script to do this would be most appreciated.
thank you in advance.

                                       yugal



Wed, 03 Dec 2003 08:20:54 GMT  
 Suggestions for script?


Quote:
>Hi,

>I am processing a single huge text file (~120,000 lines) with the
>following format:

>>gi[header1]
>][text1]
>[text2]
>[text3]
>>gi[header2]
>[text4]
>[text5]
>>gi[headern]
>.
>.
>[textn]

>what i want to do is parse the file and save the ">gi" line (along
>with the subsequent text below it), up until the next ">gi" line, to
>a separate file, then do the same for the next ">gi" line so i would
>end up with something like the following (resulting in many files):

>File1:                    File 2:                File n:
>>gi[header1]              >gi[header2]           >gi[headern]
>[text1]                   [text4]                .
>[text2]                   [text5]                .
>[text3]                                          [textn]

>the trick is there is not a constant number of lines in between
>the ">gi[header]" lines.  in other words, the number of lines can
>vary between 3 and 100 between these "header" lines, in no predictable
>fashion.  all header lines contain ">gi" and the rest of the header is
>completely variable.

>please excuse the  messiness of my explanation. any pointers on how
>to implement an awk script to do this would be most appreciated.
>thank you in advance.

gawk 'BEGIN{pre="File"; cnt=0; filename=pre "" cnt}
      /^>gi/ {close(filename); filename=pre "" ++cnt}
      {print >> filename}' infile

Old awk isn't suitable for more than about 10 files.

Chuck Demas

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Wed, 03 Dec 2003 09:35:28 GMT  
 Suggestions for script?

Quote:



> >Hi,

> >I am processing a single huge text file (~120,000 lines) with the
> >following format:

> >>gi[header1]
> >][text1]
> >[text2]
> >[text3]
> >>gi[header2]
> >[text4]
> >[text5]
> >>gi[headern]
> >.
> >.
> >[textn]

> >what i want to do is parse the file and save the ">gi" line (along
> >with the subsequent text below it), up until the next ">gi" line, to
> >a separate file, then do the same for the next ">gi" line so i would
> >end up with something like the following (resulting in many files):

> >File1:                    File 2:                File n:
> >>gi[header1]              >gi[header2]           >gi[headern]
> >[text1]                   [text4]                .
> >[text2]                   [text5]                .
> >[text3]                                          [textn]

> >the trick is there is not a constant number of lines in between
> >the ">gi[header]" lines.  in other words, the number of lines can
> >vary between 3 and 100 between these "header" lines, in no predictable
> >fashion.  all header lines contain ">gi" and the rest of the header is
> >completely variable.

> >please excuse the  messiness of my explanation. any pointers on how
> >to implement an awk script to do this would be most appreciated.
> >thank you in advance.

> gawk 'BEGIN{pre="File"; cnt=0; filename=pre "" cnt}
>       /^>gi/ {close(filename); filename=pre "" ++cnt}
>       {print >> filename}' infile

> Old awk isn't suitable for more than about 10 files.

> Chuck Demas

> --
>   Eat Healthy    |   _ _   | Nothing would be done at all,

>   Die Anyway     |    v    | That no one could find fault with it.


Actually since the file is closed, old awk is fine.  The limitation on
old awk and most nawk's is a limit of 10 open files or handles.  I
believe that most other awks (mawk gawk ets) have much more
liberal limits.  I have not encountered any limits on mawk.

--
Michael  Witkowski



Wed, 03 Dec 2003 19:05:58 GMT  
 Suggestions for script?


Quote:



>> >Hi,

>> >I am processing a single huge text file (~120,000 lines) with the
>> >following format:

>> >>gi[header1]
>> >][text1]
>> >[text2]
>> >[text3]
>> >>gi[header2]
>> >[text4]
>> >[text5]
>> >>gi[headern]
>> >.
>> >.
>> >[textn]

>> >what i want to do is parse the file and save the ">gi" line (along
>> >with the subsequent text below it), up until the next ">gi" line, to
>> >a separate file, then do the same for the next ">gi" line so i would
>> >end up with something like the following (resulting in many files):

>> >File1:                    File 2:                File n:
>> >>gi[header1]              >gi[header2]           >gi[headern]
>> >[text1]                   [text4]                .
>> >[text2]                   [text5]                .
>> >[text3]                                          [textn]

>> >the trick is there is not a constant number of lines in between
>> >the ">gi[header]" lines.  in other words, the number of lines can
>> >vary between 3 and 100 between these "header" lines, in no predictable
>> >fashion.  all header lines contain ">gi" and the rest of the header is
>> >completely variable.

>> >please excuse the  messiness of my explanation. any pointers on how
>> >to implement an awk script to do this would be most appreciated.
>> >thank you in advance.

>> gawk 'BEGIN{pre="File"; cnt=0; filename=pre "" cnt}
>>       /^>gi/ {close(filename); filename=pre "" ++cnt}
>>       {print >> filename}' infile

>> Old awk isn't suitable for more than about 10 files.

>Actually since the file is closed, old awk is fine.  The limitation on
>old awk and most nawk's is a limit of 10 open files or handles.  I
>believe that most other awks (mawk gawk ets) have much more
>liberal limits.  I have not encountered any limits on mawk.

IIRC, the close function is broken in old awk.  :-)

I'm not going to bother checking, someone else will, after all,
this is Usenet.  :-)

Chuck Demas

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Thu, 04 Dec 2003 02:45:15 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Call For Suggestions: Active Forms Scripting Language

2. Fwd: Request for suggestions re: Java or Java Script

3. Call For Suggestions: Active Forms Scripting Language

4. Suggestions on how to kickoff a TCL script from inetd

5. A suggestion for handling modifications in the core that change the script API

6. Call For Suggestions: Active Forms Scripting Language

7. F-Script 1.2.4 available: Smalltalk-based scripting for Mac OS X

8. Creating an awk script to extract other scripts from a file

9. Access to Script Name Within Awk Script

10. New Product: Five-Script, the scripting language for CA-Clipper developers

11. Scripting and Scripting Language

12. Still no REXX script engine for ActiveX scripting?

 

 
Powered by phpBB® Forum Software