A few speed questions... 
Author Message
 A few speed questions...

Hello

I am trying to optimise a 'web crawler' PHP script on our server as much
as possible, and I have a few questions as to speed of functions, MySQL
queries, etc.:

--- 1. ---
Is it faster to file_get_contents($url) an URL ( for those who don't use
PHP/4.3.0, see http://www.*-*-*.com/ ) or to use some
other method? Are the fopen wrappers the most efficient way to get an URL?

--- 2. ---
As the crawler retrieves pages, it rips all the URLs out of them and
adds them to the database. The function I'm currently using to get the
URLs is:

function gl($url,$input)
{
$match_domain='_[hH][tT][tT][pP]:\/\/(.*?)(/|$)_';
preg_match($match_domain, $url, $res);
$domain=$res[1];
if (!$domain) return false;
$lookfor='/<[aA]\s.*?[hH][rR][eE][fF]=[
"\']{0,}([-.,\%_\(\)|=~;+:\?\&\/a-zA-Z0-9]+)[ "\'>]/';
preg_match_all($lookfor, $input, $data);
while (list($k, $v)=each($data[1])) {
if (stristr($v, 'javascript:')) { }
elseif (stristr($v, '//')==$v)
{
$v='http:'.$v;
$links[]=$v;

Quote:
}

elseif (stristr($v, 'http://')!=$v)
{
if (stristr($v, '/')!=$v)
$sep='/';
else $sep='';
$v='http://'.$domain.$sep.$v;
$links[]=$v;
Quote:
}
else $links[]=$v;
}



return $links;

Quote:
}

I pass the URL of the page so it can identify relative links, and the
page in a string. Are there any major (or even minor) efficiency
problems in this function, and if so, are there any suggested solutions?

--- 3. ---
This one isn't really to do with speed, but: As each URL gets crawled,
the entire contents of the page are saved into a database field. This
works fine when it is HTML, but if it's a DOC or PDF file, which we also
index and can convert to HTML, I was wondering if using addslashes() on
the content of the page, which I do to get it to save into the DB, will
damage the DOC or PDF data? If so, how could i save it to the DB without
damaging the data?

Any help on any of these questions will be greatly appreciated.

Regards,

Jasper Bryant-Greene
http://www.*-*-*.com/



Sun, 26 Jun 2005 04:33:18 GMT  
 A few speed questions...


Quote:
> Hello
> <snip>
> --- 3. ---
> This one isn't really to do with speed, but: As each URL gets crawled,
> the entire contents of the page are saved into a database field. This
> works fine when it is HTML, but if it's a DOC or PDF file, which we also
> index and can convert to HTML, I was wondering if using addslashes() on
> the content of the page, which I do to get it to save into the DB, will
> damage the DOC or PDF data? If so, how could i save it to the DB without
> damaging the data?

> Any help on any of these questions will be greatly appreciated.

> Regards,

> Jasper Bryant-Greene
> http://fatal.kiwisparks.co.nz/
> </snip>

As far as Question 3 is concerned,  addslashes is the right way to do it as
far as I know - it works for me.
use stripslashes when you retrieve from the database and all will be well.

I assume you are using a blob field to store the binary files (doc/pdf/ etc)

Regards

Ron



Sun, 26 Jun 2005 09:20:58 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. A few more quick observations on speed of Python

2. A few more quick observations on speed of Python

3. Some questions on elided text in the text widget (and a few other text widget questions)

4. Summiteers: Few new Ada sales if few new Ada hires

5. Speed..Speed..Speed

6. A few more newbie questions..

7. few questions odbc + seaside

8. A few questions

9. A few questions from a new member...

10. new to awk few questions

11. few questions mac related

12. just a few startup questions

 

 
Powered by phpBB® Forum Software