HTML scraping (WSDL HTML pattern matching) 
Author Message
 HTML scraping (WSDL HTML pattern matching)

As an excercise, I wanted to convert a old HTML scraping application I had
to .NET, in particular - to try and use a custom WSDL with HTML pattern
matching and use wsdl.exe to generate a proxy class that does the html
scrape.

I've only been able to get this to work when retrieving the first match. (As
the example in ASP.NET quickstart tutorial demonstrates)

Does anyone know the syntax for pulling out all occurences of a pattern
using this method?

I'm not sure if this is a RegExp problem or my limted knowledge of using
WSDL for screen scraping....

Example:

Just say I want to pull out all email addresses from a web page.

I created a .wsdl file which has a pattern
<tm:match name="Email" type="" pattern="mailto:(.*?)<" ignoreCase="true"/>

I then used wsdl.exe to generate a proxy.  The proxy can successfully pull
out the first instance of the matching pattern, but how do I get it to pull
out all occurences of the pattern and return a collection?

Any help or pointers for references greatly appreciated!



Sat, 27 Nov 2004 22:57:11 GMT  
 HTML scraping (WSDL HTML pattern matching)
FYI - in case anyone interested.

Found some documentation on MSDN outlining the attributes possible for the
<match> element in WSDL document - exactly what i was looking for.
By specifying a repeat value, you can get it to return an array of matches.

[The documentation says to use -1, but I only got it to work with a value of
0]


Quote:
> As an excercise, I wanted to convert a old HTML scraping application I had
> to .NET, in particular - to try and use a custom WSDL with HTML pattern
> matching and use wsdl.exe to generate a proxy class that does the html
> scrape.

> I've only been able to get this to work when retrieving the first match.
(As
> the example in ASP.NET quickstart tutorial demonstrates)

> Does anyone know the syntax for pulling out all occurences of a pattern
> using this method?

> I'm not sure if this is a RegExp problem or my limted knowledge of using
> WSDL for screen scraping....

> Example:

> Just say I want to pull out all email addresses from a web page.

> I created a .wsdl file which has a pattern
> <tm:match name="Email" type="" pattern="mailto:(.*?)<" ignoreCase="true"/>

> I then used wsdl.exe to generate a proxy.  The proxy can successfully pull
> out the first instance of the matching pattern, but how do I get it to
pull
> out all occurences of the pattern and return a collection?

> Any help or pointers for references greatly appreciated!



Mon, 29 Nov 2004 06:27:26 GMT  
 HTML scraping (WSDL HTML pattern matching)
Let me clarify that last statement "The documentation says to use -1, but I
only got it to work with a value of

Quote:
> 0"

The wsdl.exe tool does not seem to like a <match> element with a repeat
attribute of -1.   <match ... repeat="-1">
The error message I get is: " The value for 'repeats' cannot be negative."

The only way I got it to compile was to set repeat=0 in the WSDL.
I then used wsdl.exe to generate a proxy class.
The proxy correctly generated arrays where required, HOWEVER, I had to set
the MaxRepeats=-1 for MatchAttribute.

Seems like a bug to me, but the proxy class being generated is easy to
understand and modify...so no drama.


Quote:
> FYI - in case anyone interested.

> Found some documentation on MSDN outlining the attributes possible for the
> <match> element in WSDL document - exactly what i was looking for.
> By specifying a repeat value, you can get it to return an array of
matches.

> [The documentation says to use -1, but I only got it to work with a value
of
> 0]



> > As an excercise, I wanted to convert a old HTML scraping application I
had
> > to .NET, in particular - to try and use a custom WSDL with HTML pattern
> > matching and use wsdl.exe to generate a proxy class that does the html
> > scrape.

> > I've only been able to get this to work when retrieving the first match.
> (As
> > the example in ASP.NET quickstart tutorial demonstrates)

> > Does anyone know the syntax for pulling out all occurences of a pattern
> > using this method?

> > I'm not sure if this is a RegExp problem or my limted knowledge of using
> > WSDL for screen scraping....

> > Example:

> > Just say I want to pull out all email addresses from a web page.

> > I created a .wsdl file which has a pattern
> > <tm:match name="Email" type="" pattern="mailto:(.*?)<"
ignoreCase="true"/>

> > I then used wsdl.exe to generate a proxy.  The proxy can successfully
pull
> > out the first instance of the matching pattern, but how do I get it to
> pull
> > out all occurences of the pattern and return a collection?

> > Any help or pointers for references greatly appreciated!



Mon, 29 Nov 2004 07:07:07 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Screen Scraping HTML Page

2. regular expression help to match html contents.

3. Regular Expressions/Pattern Matching/Unordered pattern

4. routines for creating html doc from html templates?

5. transfer data from html 2 html

6. How to create HTML Help Popup which contains real HTML text

7. Pattern Matching in C -

8. Pattern Matching Tool

9. Pattern Matching in C

10. Help, Pattern used in Regex.Matches(...)

11. pattern matching and string replacement

12. help with a fast pattern matching utility requested

 

 
Powered by phpBB® Forum Software