How to effectively configure your search engine Robots detection module?
|
|
Your store provides you with a robot detection module (the BOT module) that will scan all expired sessions and use an effective ‘heuristic’ to determine if the visit came
from a search engine robot or from a regular browser. Traffic coming from Robot may seriously affect the accuracy of your statistics if they go undetected this is why this
module is important to use to keep your store statistics as accurate as possible.
How is it working? Your store comes with a database of several thousands or
useragent strings. The ‘BOT module’ runs once a day and scan all expired sessions. If one of the useragent is unknown (not already in your useragent database) it will
analyze this useragent further and establish a suspicious ratio. When the investigation is done it will update the suspicious BOT database and also send you an email with
its latest findings, inviting you to set the newly found useragent as a BOT or as a regular browser.
If you set the useragent as a robot, then all further traffic from that useragent will be set a ‘visits from robots’ all other traffic will be set as regular type traffic.
If traffic is coming from an unknown useragent it will still be counted as regular traffic until you make a decision about this useragent.
As we said above,
the BOT module is using an ‘heuristic’ and not an algorithm. The reason being is that such algorithm does not exist. There is not a way or algorithm that would guarantee
the detection of robot with a 100% success. The BOT module has to analyze the visit using different criterias and every time a criteria is met then the suspicious rate is
increased by a ‘weight’ which value can be configured differently for each criteria.
We have already set up, what we think is an optimal configuration for these
criterias, however, you may need to change them if the type of traffic received at your site is very different from the standard type web site.
|
|
|
Configuring the list of robot detection criterias
|
To access the robot criterias configuration panel, select the ‘Browsers List’ option from the ’OTHER SETTINGS’ menu section :
|
|
|
|
The browser list page will display, click on the configuration link at the top of the page:
|
|
|
|
The following interface will display:
|
|
|
|
To establish a suspicious rate the BOT module will add the 'Weight' defined for the criteria that have been found 'true' and then divide the total by the sum of all weights.
Here is an explanation for all the criteria listed above:
"USERAGENT" NOT in Database:
This criteria will be flagged as 'true' if the useragent of the browser analyzed is not in your useragent database. Since browsers often receive upgrades and have their useragent string updated, this criteria weight was set to a low 5%.
No Cookies:
This criteria will be flagged as 'true' if the useragent of the browser analyzed is not using cookies. Typical search engine robots do not use cookies, so the weight was set to a high 70%.
No "Referrer" for the first page:
This criteria will be flagged as 'true' if the useragent of the browser analyzed did not have a referrer value for the first page. It is not uncommon for this to happen. If you directly type the URL of a site on the address location bar then the referrer page will be empty. This is why the weight was set to a very low 2%.
The same "Referrer" for X pages:
This criteria will be flagged as 'true' if the useragent of the browser analyzed had the exact same referrer for X pages. The default value for that X parameter is actually set to 10. For this criteria the weight was set very high because it would mean that at least X web pages address were typed in the address bar for each page visited. Human beings never really do that. However search engine robots never click within a page to go from one page to another. They read a page and then scan it to detect all links in that page and then send a new request for each of the link found. This is why the weight was set to a high 70%.
"robots.txt" file request:
This criteria will be flagged as 'true' if the useragent of the browser analyzed has requested the file "robots.txt" file. No human being really request that file, this file is mainly read by search engines robots. However, since only 'well educated' robots request that file, we set the weight to 50%.
More than X pages per minute:
This criteria will be flagged as 'true' if the useragent of the browser analyzed has browsed at least X pages within one minute of time. The default value for X was set to 15 with a medium weight set to 40%. However we feel that it should be higher. No human being really browse at least 15 or more pages of a site without pausing at least 10 to 20 seconds on each page. We feel that the act of browsing 15 pages in less than 1 minute is suspicious.
More than X pages per visit:
This criteria will be flagged as 'true' if the useragent of the browser analyzed has browsed at least X pages during its visit. If your typical visitor is used to read that much pages or more on your site then you should increase the X value which default value is set to 35.
'DEBUG Values in warning email?:
When that check box is selected, the email sent to the admin will include a list of the criterias and assign the value '1' for the criteria that have been found 'true'.
|
|
Below is an example of a portion of an email which includes the DEBUG values:
|
|
1/ We feel that the following useragent has a percentage of 3% of being a Robot ....................................
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.20) Gecko/20081217 Firefox/2.0.0.20 DEBUG: criterias ua_not_in_db - 1 no_ref - 1 same_ref - 0
robots_txt - 0 x_ppm - 0 x_ppv - 0 no_cookies - 0 ....................................
|
|
|
Below is an example of a portion of an email without the DEBUG values:
|
|
1/ We feel that the following useragent has a percentage of 3% of being a Robot ....................................
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.20) Gecko/20081217 Firefox/2.0.0.20 ....................................
|
|
|
Managing the suspicious browser list
|
Managing the suspicious browser list is relatively easy. When a suspicious browser is found by the BOT module, an email will be sent to you with the information on that
browser and the ’Browser List’ page will add on the top of its page a list of all suspicious browsers found. Then, you need to go into that page and decide whether to set
the suspicious browsers as robots or as regular browsers.
If you get notified about suspicious browser activity, go to the “Browser List” page. The page will have
an additional section on top. This section is called the ‘Suspected Browsers’ section and will look like the picture below:
|
|
|
|
All you need to do is to set the Yes/No radio buttons, select the useragent checkbox on the left and then click the [Update selected] button at the bottom of that
‘Suspected Browsers’ section.
You have to be careful when deciding about a suspicious browser. Although the rates shown above are 28% and 50%, you have to know that
the typical every day browsers such as Firefox, IE or Safari will usually generate a rate of 5 to 10% no more. That means that if a suspicious browser has a rate that
exceed 50%, then you may want to investigate the corresponding useragent more closely and maybe even try to do a search on it using major search engines such as Google or
Yahoo.
|
|