FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » FUDforum » How To » Spiders and Bots
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
Spiders and Bots [message #165940] Sun, 28 August 2011 22:32 Go to next message
The Witcher is currently offline  The Witcher   United States
Messages: 675
Registered: May 2009
Location: USA
Karma: 3
Senior Member
I'm interested in the new Spider Manager, but there are a few things I don't quite understand surprise, surprise.

"Useragent:
Spider's useragent string (partial matches are accepted).
"????

I found this reference for user agent string, so I assume that these would be user agent strings for:
Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
GoogLe:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
NerdByNature: Mozilla/5.0 (compatible; NerdByNature.Bot; http://www.nerdbynature.net/bot)
Etc.

I copied those out of my server's stats/requests page, so are these examples of the strings that need to be input? I'm not seeing anything associated with spiders or bots anywhere else except in these! Is this correct?

"IP Addresses:
Comma separated list of IP Addresses used by the spider.
"????

As for IP addresses I understand that well enough, I just don't comprehend well enough how to associate the IP with Bots actually crawling the site other than checking them one at a time.

In the past I've always used a "robot.txt" file so this is a new development for me. However in the last few weeks I have had one particular IP that shows repeatedly as replying, browsing, or as errors in my log (hundreds of times) apparently requesting access to files or functions that do not exist or are not enabled.

ISP Information lists this as "JPNIC" using a range of IP's from 119.63.192.0 - 119.63.199.255, so far I have copied perhaps 30 or so of the specific IP's from within this range.

So obviously I can input those 30 IP addresses separated by commas, but is there a way to in put the entire range used by this or any other Bot/spider without inserting a hundred or more IP's within the range they use?


"I'm a Witcher, I solve human problems; not always using a sword!"
Re: Spiders and Bots [message #165961 is a reply to message #165940] Fri, 02 September 2011 15:11 Go to previous messageGo to next message
naudefj is currently offline  naudefj   South Africa
Messages: 3771
Registered: December 2004
Karma: 28
Senior Member
Administrator
Core Developer
I guess it would be possible to block users with the Spider Manager. However, that isn't really its purpose. It would be much easier to use the IP Filter ACP.
Re: Spiders and Bots [message #165988 is a reply to message #165961] Fri, 02 September 2011 20:34 Go to previous messageGo to next message
The Witcher is currently offline  The Witcher   United States
Messages: 675
Registered: May 2009
Location: USA
Karma: 3
Senior Member
I guess you missed my point: I wasn't sure what a user agent strung is, and from what I found I was uncertain of which portion of it to use! Nor was I certain of how to identify spiders and bots beyond individual searches or online lists.

As for "JPNIC" I could not identify them as a bot or spider or explain all the reply or page not found Fud errors they were generating and I didn't just want to ban a range of IP's without knowing since the forums subject has an international user base.

But I think this HERE explains it.

As for "Spiders and Bots"
I defined Google as a test on a backup. Today Google is the newest registered user, on logging out I see:

Warning: preg_match() [function.preg-match]: Unknown modifier '5' in /home/user/domain.name/index.php on line 309
309	if (preg_match('/^'. $spider['useragent'] .'/i', $_SERVER['HTTP_USER_AGENT'])) {
					if (empty($spider['bot_ip'])) {

Quote:
Defined spider:
Google: GoogLe:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) : xx.xxx.xx.xxx


I Deleted the '/5.0" from spider definition and checked again:
Warning: preg_match() [function.preg-match]: Unknown modifier '2' in /home/user/domain.name/index.php on line 309
I Deleated: "(compatible; Googlebot/2.1; +http://www.google.com/bot.html)" from spider definition.

No further warnings yet! So it appears that you just need the user name and browser type.



"I'm a Witcher, I solve human problems; not always using a sword!"
Re: Spiders and Bots [message #166014 is a reply to message #165961] Sat, 03 September 2011 18:16 Go to previous messageGo to next message
The Witcher is currently offline  The Witcher   United States
Messages: 675
Registered: May 2009
Location: USA
Karma: 3
Senior Member
As usual I completely missed the obvious and was reading too much into the simple instructions!
On a fresh 3.0.3RC2 install three bot/spiders are already listed, providing a clear example of the form the name and user agent string needs to take, the only thing missing is the IP address which is easy enough to find on line.

Bot Name	Useragent	
Bing	             msnbot		
Google	             Googlebot
Yahoo!	             Slurp	

So obviously the browser type isn't required either which makes sense, seeing as there are so many different browsers in use.



"I'm a Witcher, I solve human problems; not always using a sword!"
Re: Spiders and Bots [message #166440 is a reply to message #165940] Mon, 12 December 2011 04:53 Go to previous message
Rocksteve is currently offline  Rocksteve
Messages: 3
Registered: December 2011
Karma: 0
Junior Member
In the past I've always victimised a "golem.txt" line so this is a new development for me. Nevertheless in the high few weeks I get had one peculiar IP that shows repeatedly as replying, browsing, or as errors in my log (hundreds of present) seemingly requesting admittance to files or functions that do not survive or are not enabled. ISP Assemblage lists this as "JPNIC" using a comprise of IP's from 119.63.192.0 - 119.63.199.255, so far I person copied perhaps 30 or so of the special IP's from within this ambit.
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: no images in my forum
Next Topic: change the default email address
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sat Nov 30 23:43:44 GMT 2024

Total time taken to generate the page: 0.02102 seconds