Extracting top level domain from remote host 
Author Message
 Extracting top level domain from remote host

I have a script on my web site that generates a log for each visit.  It
stores the date and time, host name of the visitor and his browser type.
Normally I use this for troubleshooting purposes but I found another use for
this.  My web host does not provide a distribution of the countries that
generate hits on my site (unlike a host I had in the past).

I figured that from this very same log I have been keeping all along I can
get this information so I decided to make a script to look top level domain
for all hits in the log.  Easy enough, I said.  And it's true except for the
regexp needed to do the job.  I do not yet fully understand the regular
expression beast and so far I have only worked to some extent (and with help
in some cases) with pattern matching and substitution.  What I want to do
now is to extract the top level domain to make some arrays and hashes and
such (I know how to handle that part) to generate the statistical
information I want.

I am using a loop like this:


 {
  ($dummy,$host,$dummy) = split (/,/, $thisentry);
  $host =~ /\.\w*$/i;
  print "$host ";
 }


comma separated field is the one with the host name (gotten from the
REMOTE_HOST environmente variable).  The split part works properly.  Next
line with the regular expresion is the one that doesn't work.  $host comes
out blank.  It is printing for testing purposes but when this finally works
then the loop will do its intended job.

I want to extract everything after the last period which will be from two
(for country tlds) to four characters (yes, there is an .arpa tld - wonder
what is it for).  The length is not important.  I just want to make a list
of all tlds found in the log file to look at it and generate the country
statistical information I want.

What is the proper regexp to get this information?

Luis E. Rodriguez



Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host

Quote:

>This small modification should do what you need..


> ($dummy,$host,$dummy) = split (/,/, $thisentry);
> if ($host =~ /\.(\w+)$/){
>  $host = $1;
>  print "$host\n";
> }
>}

That did it.  Thank you, very much.  I see I was close enough.  Didn't know
that the results were stored in $1.

Luis E. Rodriguez



Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host

Quote:
>I have a script on my web site that generates a log for each visit.  It
>stores the date and time, host name of the visitor and his browser type.
>Normally I use this for troubleshooting purposes but I found another use for
>this.  My web host does not provide a distribution of the countries that
>generate hits on my site (unlike a host I had in the past).

   The host you had in the past was feeding you bogus info.

   TLDs are just a convention.

   There are TLDs outside of the U.S. that end in .com

   There are TLDs inside of the U.S. that end in .to

   Firewalls, gateways, international corporate networks, etc
   muddy the waters even further...

Quote:
>I figured that from this very same log I have been keeping all along I can
>get this information

   Realizing that "this information" is not at all reliable for
   determining what country a browser is in, I hope.

Quote:
> so I decided to make a script to look top level domain
>for all hits in the log.  Easy enough, I said.  And it's true except for the
>regexp needed to do the job.  I do not yet fully understand the regular
>expression beast and so far I have only worked to some extent (and with help
>in some cases) with pattern matching and substitution.  What I want to do
>now is to extract the top level domain to make some arrays and hashes and
>such (I know how to handle that part) to generate the statistical
>information I want.

>I am using a loop like this:


> {
>  ($dummy,$host,$dummy) = split (/,/, $thisentry);

   You don't need dummy variables with a modern perl.

   You also don't need placeholders "at the end" as any extra
   values will be discarded by perl anyway.

      (undef,$host) = split (/,/, $thisentry);

   Or even:

      $host = (split (/,/, $thisentry))[1];

Quote:
>  $host =~ /\.\w*$/i;

   A "bare match" like that is useless.

   It cannot possibly have any effect on any variable in your program.
   (it does not "change" anything)

   A "bare substitution" is sometimes what is wanted, but a bare
   match is _never_ what is wanted.

   A match with some capturing parenthesis in the pattern is
   _not_ useless, but the match should still not be "bare".

   It should be in a conditional somewhere if you plan to make
   use of capturing parens.

   The 'i' option is also useless and can have no effect.

   The 'i' option affects the case of letters in your pattern,
   but you do not _have_ any letters in your pattern.

   So what you have might be "useful" if used like:

      if ( $host =~ /\.(\w*)$/ )
         # do something with $1 here
         my $tld = $1;
         print "the TLD of '$host' is '$tld'\n";
      }

Quote:
>  print "$host ";
> }


>comma separated field is the one with the host name (gotten from the
>REMOTE_HOST environmente variable).  The split part works properly.  Next
>line with the regular expresion is the one that doesn't work.  $host comes
>out blank.  

   If it is "coming out" blank, then it "went in" blank,
   as the code you have does not modify the contents of $host
   anywhere following the split().

   Try putting a print after the split() and before the match.

Quote:
> It is printing for testing purposes but when this finally works
>then the loop will do its intended job.

   Eh?

Quote:
>I want to extract everything after the last period

   That would be:

      /\.([^.]+)$/;  # a dot then one or more non-dot chars at end of string

   What you have does not get "everything" after the dot.

   It will only match letters, digits, and underscores after the dot.

Quote:
>I just want to make a list
>of all tlds found in the log file to look at it and generate the country
>statistical information I want.

   As long as you keep in mind that your stats are only WAG-quality
   (Wild Ass Guess).

Quote:
>What is the proper regexp to get this information?

   See above.

--
    Tad McClellan                          SGML Consulting

    Fort Worth, Texas



Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host


Quote:


>  {
>   ($dummy,$host,$dummy) = split (/,/, $thisentry);
>   $host =~ /\.\w*$/i;
>   print "$host ";
>  }

This small modification should do what you need..


 ($dummy,$host,$dummy) = split (/,/, $thisentry);
 if ($host =~ /\.(\w+)$/){
  $host = $1;
  print "$host\n";
 }

Quote:
}

Wyzelli


Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host

<snip>

Quote:
> I am using a loop like this:


>  {
>   ($dummy,$host,$dummy) = split (/,/, $thisentry);
>   $host =~ /\.\w*$/i;
>   print "$host ";
>  }


> comma separated field is the one with the host name (gotten from the
> REMOTE_HOST environmente variable).  The split part works properly.  Next
> line with the regular expresion is the one that doesn't work.  $host comes
> out blank.  It is printing for testing purposes but when this finally works
> then the loop will do its intended job.

$host shouldnt be blank it should be unchanged.

Quote:
> I want to extract everything after the last period which will be from two
> (for country tlds) to four characters (yes, there is an .arpa tld - wonder
> what is it for).  The length is not important.  I just want to make a list
> of all tlds found in the log file to look at it and generate the country
> statistical information I want.

> What is the proper regexp to get this information?

You need to capture the required part of the string :

   print $1 if( $host =~ /\.(\w+)$/ );

Of course you will want to do something else with $1 if you want to do
some analysis on this stuff.

You will of course want to read the perlre manpage for further information
on Regular Expressions.

/J\
--

<http://www.gellyfish.com>
Hastings: <URL:http://dmoz.org/Regional/UK/England/East_Sussex/Hastings>



Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host

Quote:

>   The host you had in the past was feeding you bogus info.
>   TLDs are just a convention.
>   There are TLDs outside of the U.S. that end in .com

I know that.  And this is not a problem for my application.  It's for
internal use only to get an idea on where visitors come from.  It is not
going to be used for decision making purposes.

The information is not bogus.  It's based on whatever the domain name says
it is.  The fact that .com .net and .org can be anywhere in the world
doesn't make the information gathered less useful.

I did run a first draft of my script using the regexp suggested in another
post and I found it interesting that .com is not at the top of the list.
The top tld is .net because there are far more ISPs around under the .net
tld than under .com.  Second on the list was a country specific domain name
and .com came out third.  The majority of the .com and .net are in the US
anyway.

Quote:
>   There are TLDs inside of the U.S. that end in .to

I know that too but those are used for redirects of web pages, not by ISPs
and thus it is unlikely that any .to hit is not coming from Tonga.  Most web
surfers are connected to the Internet either by corporate networks (which
normally use .com or their own country-specific tld), academic networks
(under .edu in the US, under their own country tld elsewhere) or via a
commercial or non-commercial ISP (most of which use .net or their
country-specific tld).

The period covered by the log I am using includes some 2000 hits and there
is only one hit from .org and no hits from .to.  If it is worth mentioning,
there are actually five hits from the .us tld.



Wed, 18 Jun 1902 08:00:00 GMT  
 Extracting top level domain from remote host
Luis,

You have made a common error.  The =~ operator (bind operator) does NOT
modify anything -- it tests the expression to see if it is true.  (See
Camel book, p. 81.)

You will be better served by using split, maybe something like


        $host = $members[-1];
        print $host, "\n";

Good luck,

Marty Brownfield
Federal Express

Quote:

> I have a script on my web site that generates a log for each visit.  It
> stores the date and time, host name of the visitor and his browser type.
> Normally I use this for troubleshooting purposes but I found another use for
> this.  My web host does not provide a distribution of the countries that
> generate hits on my site (unlike a host I had in the past).

> I figured that from this very same log I have been keeping all along I can
> get this information so I decided to make a script to look top level domain
> for all hits in the log.  Easy enough, I said.  And it's true except for the
> regexp needed to do the job.  I do not yet fully understand the regular
> expression beast and so far I have only worked to some extent (and with help
> in some cases) with pattern matching and substitution.  What I want to do
> now is to extract the top level domain to make some arrays and hashes and
> such (I know how to handle that part) to generate the statistical
> information I want.

> I am using a loop like this:


>  {
>   ($dummy,$host,$dummy) = split (/,/, $thisentry);
>   $host =~ /\.\w*$/i;
>   print "$host ";
>  }


> comma separated field is the one with the host name (gotten from the
> REMOTE_HOST environmente variable).  The split part works properly.  Next
> line with the regular expresion is the one that doesn't work.  $host comes
> out blank.  It is printing for testing purposes but when this finally works
> then the loop will do its intended job.

> I want to extract everything after the last period which will be from two
> (for country tlds) to four characters (yes, there is an .arpa tld - wonder
> what is it for).  The length is not important.  I just want to make a list
> of all tlds found in the log file to look at it and generate the country
> statistical information I want.

> What is the proper regexp to get this information?

> Luis E. Rodriguez



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. need regex to extract domain.name from host.domain.name

2. how to extract remote host ?

3. Hosting companies, find all domains that are hosted

4. One Level, Two Level, Blue Level, Purple Level

5. Is Chem:: okay as a top-level namespace?

6. Request for a top level name in comp.lang.perl.modules

7. Subroutines & level from top most function

8. Top level design

9. resizing top level frame

10. top level window sizes ??

11. Resizing top-level window after packForget

12. making a tear-off menu a top-level

 

 
Powered by phpBB® Forum Software