HTML header parsing 
Author Message
 HTML header parsing

If you're downloading web pages, this code will parse an HTML header into a
UDT which makes it easier for querying:

TYPE HTML_HEADER
  ErrorMsg             AS ASCIIZ * 100
  Current_Date         AS ASCIIZ * 36
  Server_Type          AS ASCIIZ * 100
  LastModdate          AS ASCIIZ * 36
  Etag                 AS ASCIIZ * 24
  AcceptRanges         AS ASCIIZ * 12
  Content_Type         AS ASCIIZ * 24
  Content_Length       AS LONG
  Content_Location     AS ASCIIZ * 128
END TYPE

FUNCTION RegEx(BYVAL find AS STRING, buffer AS STRING) AS STRING

  LOCAL p AS LONG
  LOCAL l AS LONG

  find = "^" & find & ".*$"
  REGEXPR find IN buffer TO p, l
  IF l THEN
    FUNCTION = REMAIN$(MID$(buffer, p, l), ": ")
  END IF

END FUNCTION

FUNCTION ParseHeader(html AS STRING) AS STRING

  LOCAL udt AS HTML_HEADER
  LOCAL head AS STRING

  head = EXTRACT$(html, $CRLF & $CRLF)

  udt.ErrorMsg         = PARSE$(head, $CRLF, 1)
  udt.Server_Type      = RegEx("Server:", head)
  udt.Content_Type     = RegEx("Content-Type:", head)
  udt.Current_Date     = RegEx("Date:", head)
  udt.LastModdate      = RegEx("Last-Modified:", head)
  udt.Etag             = RegEx("ETag:", head)
  udt.AcceptRanges     = RegEx("Accept-Ranges:", head)
  udt.Content_Length   = VAL(RegEx("Content-Length:", head))
  udt.Content_Location = RegEx("Content-Location:", head)

  FUNCTION = udt

END FUNCTION

FUNCTION PbMain()

  LOCAL html AS STRING
  LOCAL htmlHead AS HTML_HEADER

  html = "HTTP/1.1 200 OK" & $CRLF & _
         "Server: Microsoft-IIS/5.0" & $CRLF & _
         "Content-Location: http://www.*-*-*.com/ ; & $CRLF &
_
         "Date: Fri, 01 Dec 2000 05:43:38 GMT" & $CRLF & _
         "Content-Type: text/html" & $CRLF & _
         "Accept-Ranges: bytes" & $CRLF & _
         "Last-Modified: Wed, 29 Nov 2000 20:47:27 GMT" & $CRLF & _
         "ETag: ""c8d3c195455ac01:1511""" & $CRLF & _
         "Content-Length: 21436" & $CRLF & $CRLF & _
         "web content goes here"

  LSET htmlHead = ParseHeader(html)

  PRINT "   Error Message: "; htmlHead.ErrorMsg
  PRINT "     Server Type: "; htmlHead.Server_Type
  PRINT "    Content Type: "; htmlHead.Content_Type
  PRINT "    Current Date: "; htmlHead.Current_Date
  PRINT "   Last Modified: "; htmlHead.LastModdate
  PRINT "            ETag: "; htmlHead.Etag
  PRINT "   Accept Ranges: "; htmlHead.AcceptRanges
  PRINT "  Content Length:";  htmlHead.Content_Length
  PRINT "Content Location: "; htmlHead.Content_Location

END FUNCTION



Sun, 01 Feb 2004 01:33:45 GMT  
 HTML header parsing
All good stuff.

Thanks

Derek

Quote:

> If you're downloading web pages, this code will parse an HTML header into a
> UDT which makes it easier for querying:

<SNIPPED useful stuff which you can see in a previous posting>


Sun, 01 Feb 2004 08:18:05 GMT  
 HTML header parsing
Dave, why not post this on the PB site, it will be stored better.
I don't need it now but maybe later you see.
If i store it i might lose it too :)

Thanks,

Quote:

>If you're downloading web pages, this code will parse an HTML header into a
>UDT which makes it easier for querying:

>TYPE HTML_HEADER
>  ErrorMsg             AS ASCIIZ * 100
>  Current_Date         AS ASCIIZ * 36
>  Server_Type          AS ASCIIZ * 100
>  LastModdate          AS ASCIIZ * 36
>  Etag                 AS ASCIIZ * 24
>  AcceptRanges         AS ASCIIZ * 12
>  Content_Type         AS ASCIIZ * 24
>  Content_Length       AS LONG
>  Content_Location     AS ASCIIZ * 128
>END TYPE

>FUNCTION RegEx(BYVAL find AS STRING, buffer AS STRING) AS STRING

>  LOCAL p AS LONG
>  LOCAL l AS LONG

>  find = "^" & find & ".*$"
>  REGEXPR find IN buffer TO p, l
>  IF l THEN
>    FUNCTION = REMAIN$(MID$(buffer, p, l), ": ")
>  END IF

>END FUNCTION

>FUNCTION ParseHeader(html AS STRING) AS STRING

>  LOCAL udt AS HTML_HEADER
>  LOCAL head AS STRING

>  head = EXTRACT$(html, $CRLF & $CRLF)

>  udt.ErrorMsg         = PARSE$(head, $CRLF, 1)
>  udt.Server_Type      = RegEx("Server:", head)
>  udt.Content_Type     = RegEx("Content-Type:", head)
>  udt.Current_Date     = RegEx("Date:", head)
>  udt.LastModdate      = RegEx("Last-Modified:", head)
>  udt.Etag             = RegEx("ETag:", head)
>  udt.AcceptRanges     = RegEx("Accept-Ranges:", head)
>  udt.Content_Length   = VAL(RegEx("Content-Length:", head))
>  udt.Content_Location = RegEx("Content-Location:", head)

>  FUNCTION = udt

>END FUNCTION

>FUNCTION PbMain()

>  LOCAL html AS STRING
>  LOCAL htmlHead AS HTML_HEADER

>  html = "HTTP/1.1 200 OK" & $CRLF & _
>         "Server: Microsoft-IIS/5.0" & $CRLF & _
>         "Content-Location: http://www.microsoft.com/Default.htm" & $CRLF &
>_
>         "Date: Fri, 01 Dec 2000 05:43:38 GMT" & $CRLF & _
>         "Content-Type: text/html" & $CRLF & _
>         "Accept-Ranges: bytes" & $CRLF & _
>         "Last-Modified: Wed, 29 Nov 2000 20:47:27 GMT" & $CRLF & _
>         "ETag: ""c8d3c195455ac01:1511""" & $CRLF & _
>         "Content-Length: 21436" & $CRLF & $CRLF & _
>         "web content goes here"

>  LSET htmlHead = ParseHeader(html)

>  PRINT "   Error Message: "; htmlHead.ErrorMsg
>  PRINT "     Server Type: "; htmlHead.Server_Type
>  PRINT "    Content Type: "; htmlHead.Content_Type
>  PRINT "    Current Date: "; htmlHead.Current_Date
>  PRINT "   Last Modified: "; htmlHead.LastModdate
>  PRINT "            ETag: "; htmlHead.Etag
>  PRINT "   Accept Ranges: "; htmlHead.AcceptRanges
>  PRINT "  Content Length:";  htmlHead.Content_Length
>  PRINT "Content Location: "; htmlHead.Content_Location

>END FUNCTION



Tue, 03 Feb 2004 03:55:48 GMT  
 HTML header parsing

says...

Quote:
> Dave, why not post this on the PB site, it will be stored better.
> I don't need it now but maybe later you see.
> If i store it i might lose it too :)

I've posted all of this stuff on the PB site, but some people don't visit
often.  And like I said in a previous post, it had been awhile since code
was posted in the forum.

Besides, now it's searchable at Google Groups. <grin>

--Dave



Wed, 04 Feb 2004 07:06:25 GMT  
 HTML header parsing
Google groups?

I'm gonna check this out.
I do have some problems with the NG access with this provider.
dejanews or "Google groups" might be an outcome.

Thanks,

Quote:


>says...
>> Dave, why not post this on the PB site, it will be stored better.
>> I don't need it now but maybe later you see.
>> If i store it i might lose it too :)

>I've posted all of this stuff on the PB site, but some people don't visit
>often.  And like I said in a previous post, it had been awhile since code
>was posted in the forum.

>Besides, now it's searchable at Google Groups. <grin>

>--Dave



Wed, 04 Feb 2004 16:35:16 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. newbie: parsing SMTP headers

2. i got it: awk & parse SMTP Headers

3. Patch for header parsing

4. parsing/decoding mail/news headers

5. protoize and parsing C and C++ headers to Scheme

6. ANN: http header parsing

7. Parsing C++ headers with Python

8. Parsing http headers

9. parsing http headers

10. Non-parsed headers in Python

11. ANN: cdecl.py - parses c header files

12. HTML headers?

 

 
Powered by phpBB® Forum Software