HTML parser in Turbo Pascal anyone? 
Author Message
 HTML parser in Turbo Pascal anyone?

has anyone tried this? I am interested in getting the source as I am not
entirely familiar with HTML. my goal would be a lightweight
browser(simple)  that would not require a ppp/slip link.. I want to do it
in Pascal as an example to one of my instructors who said it cannot be
done efficiently.. if you get the notion to do a little programming post
it here or alt.sources or e-mail it to me
                                Thank you
                                        Shawn

--
-------------------------------------------------------------------------------
phishin                                  

-------------------------------------------------------------------------------



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

Quote:
>has anyone tried this? I am interested in getting the source as I am not
>entirely familiar with HTML. my goal would be a lightweight
>browser(simple)  that would not require a ppp/slip link.. I want to do it
>in pascal as an example to one of my instructors who said it cannot be
>done efficiently.. if you get the notion to do a little programming post
>it here or alt.sources or e-mail it to me

Possibly something like a command-line parser?
Load a file as FILE OF CHAR (not text. TP is limited to 127 chars per
line and some editors (some people are weird enough to use editors..) put
the whole thing in one long line), read it. When you encounter a <, save
all text until you hit a > (or for a "debug" HTML system, if you hit
another < before a >, write the stuff up to there, then parse it.
Parse it to have the main command. Then setup each paramter, and the data
for them if any.

I was actually looking at doing such a thing myself a couple weeks ago,
but decided instead to get to work on my BBS software and updating my doors.
--
Of all the things I've lost, I miss my mind the most.

Procrastination is the sin of self destruction.



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

Quote:

>Possibly something like a command-line parser?
>Load a file as FILE OF CHAR (not text. TP is limited to 127 chars per
>line and some editors (some people are weird enough to use editors..) put
>the whole thing in one long line), read it.

I believe you may be thinking of the length of a source line
rather than the length of a TP string.  And if you think a 255
character TP string may be too short, then you can use zero-based
character arrays and/or pChar with TP 6.0/newer version of TP.

If you still prefer to process a file character at a time, then
consider using an untyped file which will allow reading more than
one character per system call, or simply reading characters from a
text file.  

When reading a text file a character at a time, you can read
_everything_, including CR-LF's, logical EOF (^Z), etc.  The
advantage over a file of byte is that by default the text file
does file i/o through a 128 byte buffer that can be made larger
with SetTextBuf.  Reading/Writing only one character per system
call is one way to insure poor program performance.

    ...red



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

Quote:
>>Possibly something like a command-line parser?
>>Load a file as FILE OF CHAR (not text. TP is limited to 127 chars per
>>line and some editors (some people are weird enough to use editors..) put
>>the whole thing in one long line), read it.

>I believe you may be thinking of the length of a source line
>rather than the length of a TP string.  And if you think a 255
>character TP string may be too short, then you can use zero-based
>character arrays and/or pChar with TP 6.0/newer version of TP.

I believe you may have been tired or drunk when you wrote that.
A TEXT file line in TP is limited to 127 chars. Check "TextRec". The
buffer is 127 bytes.

And what good would a full 65k array of char be when a 100k HTML file is
written with NO CR/LFs at all? You'd need a 100k array of char.
Better to just read the file. It really wouldn't be too hard anyway. Read
the whole thing into EMS and read it byte by byte instead of reading it
string by string from file, and byte by byte in string. Same difference,
but faster and supports huge anti-crlf files created by some editors.

Quote:
>If you still prefer to process a file character at a time, then
>consider using an untyped file which will allow reading more than
>one character per system call, or simply reading characters from a
>text file.  

Yeah thats what I meant. Didn't mean to put OF CHAR in my original post,
but whats done is done...

Quote:
>When reading a text file a character at a time, you can read
>_everything_, including CR-LF's, logical EOF (^Z), etc.  The
>advantage over a file of byte is that by default the text file
>does file i/o through a 128 byte buffer that can be made larger
>with SetTextBuf.  Reading/Writing only one character per system
>call is one way to insure poor program performance.

--
Of all the things I've lost, I miss my mind the most.

Procrastination is the sin of self destruction.



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?


A> >I believe you may be thinking of the length of a source line
A> >rather than the length of a TP string.  And if you think a 255
A> >character TP string may be too short, then you can use zero-based
A> >character arrays and/or pChar with TP 6.0/newer version of TP.
A>
A> I believe you may have been tired or drunk when you wrote that.
A> A TEXT file line in TP is limited to 127 chars. Check "TextRec". The
A> buffer is 127 bytes.

The size of TextRec.Buffer is irrelevant, except for speed purposes...you can
read a full 255 characters per line in TP, more if using PChars.

--
| Jeff Teunissen -=- President, Dusk To Dawn Computing       Team OS/2 Member
| Disclaimer: I am my employer, so anything I say goes for me as well.     :)



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

Quote:

>>>Possibly something like a command-line parser?
>>>Load a file as FILE OF CHAR (not text. TP is limited to 127 chars per
>>>line and some editors (some people are weird enough to use editors..) put
>>>the whole thing in one long line), read it.
>Roger E. Doanis replied:
>>I believe you may be thinking of the length of a source line
>>rather than the length of a TP string.  And if you think a 255
>>character TP string may be too short, then you can use zero-based
>>character arrays and/or pChar with TP 6.0/newer version of TP.


Quote:
>I believe you may have been tired or drunk when you wrote that.
>A TEXT file line in TP is limited to 127 chars. Check "TextRec". The
>buffer is 127 bytes.

Well Eric, it's obvious that you have a major mis-understanding of
TEXT files.  Please note that TextBuf = array[0..127] of Char
defines a 128 byte buffer.  I know that the concept of zero being
something other than empty and/or a non-existant value may be a
hard concept to grasp, but non-the-less, zero _does_ have value.
Maybe if you think of an array[-1..+1] and when realize it is an
array of three elements (-1, 0, +1) you might begin to understand.
Of source 7 is next to the 8 and you could simply have made a
typo.  

I'd have believed typo and not bothered to reply if it wasn't for
your instance that a text file can only read 127 byte long
strings.  Your post could be confusing to some of students that
lurk here trying to glean a few crumbs of wisdom.

The file buffer has nothing to do with the maximum size variable
that can be read from a text file.  It only controls the amount of
data that will be transferred by each system call.  

If you enable extended syntax {$X+}, you can define a zero-based
array of character, like array[0..$BFFF] of Char, and would be
able to use readln/writeln to read and write it from a text file.

Let's consider writing a string since this is where you think I
might be hallucinating.  The WriteStr routine from the SYSTEM Unit
would be given a pointer to the filevar (a VAR parameter is passed
as a pointer) and a pointer to the string variable, literal, or
constant.  It then moves as much of the string as it can to the
file buffer.  If the file buffer becomes full, it calls the file's
InOutFunc using the appropriate pointer in the filevar to have the
buffer written to disk, it then moves as much of the remaining
portion of the string as will fit in the file buffer and if the
buffer is again full, once more calls the file's InOutFunc.  The
process continues until the entire variable has been transferred.

A similar process occurs when reading a string.  In addition to
the pointer to the filevar and to the string variable, the ReadStr
routine in the SYSTEM Unit is also passed the maximum length of
the string variable.  This routine transfers characters from the
file buffer and invokes the file's InOutFunc to refill the buffer
when it is empty and repeats until the string is full, until an
EOL character (#13), logical EOF (^Z) or physical EOF is reached.

What you need to notice is that the size of the file buffer does
not restrict the size of the variable that can be read/written,
but simply controls the number of characters that are read/written
on each call to the operating system.

Now, before you jump to conclusions and claim that there would
never be an instance where either routine would have to fill the
buffer more than three times to transfer a 255 byte string,
consider what would happen if we use SetTextBuf to defined a new
file buffer of one character or of 49,152 bytes.

Quote:

>And what good would a full 65k array of char be when a 100k HTML file is
>written with NO CR/LFs at all? You'd need a 100k array of char.
>Better to just read the file. It really wouldn't be too hard anyway. Read
>the whole thing into EMS and read it byte by byte instead of reading it
>string by string from file, and byte by byte in string. Same difference,
>but faster and supports huge anti-crlf files created by some editors.

Tell me Eric, are you one of those individuals that is unable to
process a file unless the entire thing can be placed in memory
first?  How would you go about processing a 450 meg file?  

Also I'd like to know the basis upon which you claim "faster".
Also, considering that you acknowledge the next paragraph about
methods that allow processing a file character at a time why you
attempt to make it appear that the only solution I offered was to
process TP strings byte at a time?

Quote:

>>If you still prefer to process a file character at a time, then
>>consider using an untyped file which will allow reading more than
>>one character per system call, or simply reading characters from a
>>text file.  

>Yeah thats what I meant. Didn't mean to put OF CHAR in my original post,
>but whats done is done...

Even if you didn't mean to scream "FILE OF CHAR", I can't help but
chuckle that instead of being satisfied that you put your foot in
your mouth, you decided to place it there yet again. :-D

Quote:

>>When reading a text file a character at a time, you can read
>>_everything_, including CR-LF's, logical EOF (^Z), etc.  The
>>advantage over a file of byte is that by default the text file
>>does file i/o through a 128 byte buffer that can be made larger
>>with SetTextBuf.  Reading/Writing only one character per system
>>call is one way to insure poor program performance.
>--
>Of all the things I've lost, I miss my mind the most.

I guess this is more true than we first thought.   ;-)

    ...red



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

Quote:
>The size of TextRec.Buffer is irrelevant, except for speed purposes...you can
>read a full 255 characters per line in TP, more if using PChars.

Like I said. Useless if its one big 100k long line.
--
Of all the things I've lost, I miss my mind the most.

Procrastination is the sin of self destruction.



Wed, 18 Jun 1902 08:00:00 GMT  
 HTML parser in Turbo Pascal anyone?

hi.

Quote:
> >The size of TextRec.Buffer is irrelevant, except for speed purposes...you can
> >read a full 255 characters per line in TP, more if using PChars.
> Like I said. Useless if its one big 100k long line.

don't see the need. you can (and should) read html files char by char. i
strongly suggest a tBufStream for fast access (and unlimited ungetChar).

to get some kind of line counting simply check read char
  - #10, skip all following #13
  - #13, skip all following #10
this pretty get all possible combinations #10, #10#13, #13#10 and #13.

##

some time ago i made a  tInput  object (not *that* cool, but works), that
supports multifile input from streams, memory and stack. this might be of
basic interest to catch includes like <script src="some.js">.

if someone's interested i could post it to this group.

bye

Thomas.  \/    

      ~~ \/ ~~ WWW : http://www.stud.fernuni-hagen.de/q4307909/welcome.htm



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. FUNCTION Parser/Evaluator for turbo pascal 7.0

2. Outputting a file to HTML source in turbo pascal

3. HTML Turbo Pascal FAQ

4. HTML Editor in Turbo Pascal

5. Anyone have old Turbo Pascal?

6. Anyone encounter a certain turbo pascal program?

7. anyone have turbo pascal 6.0 compiler

8. Anyone have Turbo Pascal 5.5?

9. can anyone send me the turbo pascal 6.0

10. Anyone got the syntax of Turbo Pascal??

11. +++++++ can anyone help me with file controlling scripts for turbo pascal 7 (MS-DOSq)

12. Anyone know which version Turbo Pascal needed?

 

 
Powered by phpBB® Forum Software