Website analytics from scratch
Last updated: 2020-08-25
Introduction
From my point of view, machine logs are among the cream of the crop as a data source, because they are verbose, consistent over long time periods and rarely contain erroneous or missing data. Among Data Scientists, a very frequent complaint is data quality - people will surprise you with what they enter into forms, sensors can fail or record physically impossible values. On the other hand, working with machine logs usually drastically shortens the data cleaning process. As such, it’s amazing how little is being written about them. As an example of leveraging machine logs for insights, in this blog we’ll analyze web server access logs.
When hearing about website analytics, the first thing that comes to mind is a Cookie, “a small piece of data stored on the user’s computer by the web browser while browsing a website” [Wikipedia, accessed 2020-08-19]. Maybe you will have noticed the missing Cookie disclaimer on this blog. As with every tool, you should consider if using it will be worthwhile; in the case of Cookies, if the annoyance some people feel will be offset by the benefit you will get out of the analysis. Also, consider that a lot of people will opt out anyway, as this interesting blog post investigated. On the other hand, although “tracking” has gotten a bad reputation because of cases of blatant over- or misuse, it generally fulfills an important role: feedback to the content provider. Seeing how people use your website can provide insight about
- Which are the most common referrers, i.e. sites containing a link by which visitors come to your website?
→ Help the right people find your content. - What are the most common paths that visitors take on your web site? Do they concur with the usage you would expect?
→ Maybe your site structure could be improved. - Which pieces of content are viewed together?
→ Offer convenient recommendations for relevant content. - Where are visitors to your website from?
→ Identify main regions of impact.
If you do decide to implement a Cookie-based solution, a technical guide to building website analytics can be found in this blog post series. They provide a walk-through of the necessary server configuration as well as hints to building a Kibana dashboard on top.
In this blog we will focus on an alternative to using tracking Cookies: analyzing website traffic using only what most web servers provide by default - access logs.
Analysis
Parsing and filtering
We assume Combined Log Format. To capture the individual components of the log, we’ll use a regular expression
regex = '([(\d\.)]+) - ([a-zA-Z0-9\-@\.]+) \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)"'
- ip address
- user name (if the client is logged in)
- datetime
- request
- status code of the request
- size of the object sent back to the client
- referrer
- user agent
For example parsing the following line
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
yields
- 127.0.0.1
- frank
- 10/Oct/2000:13:55:36 -0700
- GET /apache_pb.gif HTTP/1.0
- 200
- 2326
- http://www.example.com/start.html
- Mozilla/4.08 [en] (Win98; I ;Nav)
We’ll be using open access data1 from an Iranian online shopping store for illustration.
The data contains around 10 million rows and spans 4 days (2019-01-22 to 2019-01-26). Importantly, the access log contains everything, i.e. accessing one page can entail multiple requests - the browser has to get the HTML of the site, the media, the style sheets. For now, we are interested in the site navigation, so we’ll discard anything not related to page navigation.
filter_tags = []
# filter static requests like image and CSS loading
filter_tags += ['GET /static/', 'GET /image/', 'favicon.ico',
'settings/logo', 'manifest.json', 'HEAD /image/']
# Website specific: eNAMAD is a certificate needed by Iranian web shops
filter_tags += ['/site/enamad']
# Website specific: probably mobile phone app related
filter_tags += ['amp-helper-frame']
# Website specific: not very interesting but occurs often
filter_tags += ['/site/alexaGooleAnalitic', 'filter',
'site/searchAutoComplete', '/variationGroup/',
'/OneSignalSDKWorker']
# filter out bots
filter_tags += ['Bot', 'bot']
# filter out POST requests
filter_tags += ['POST /']
if any(tag in line for tag in filter_tags):
# filter out this line
After filtering we’re left with less than half a million requests.
Referrals
The simplest thing first: counting which website containing a link to your website is responsible for how much traffic.
As we can see, the three most common referrers are https://www.google.com/, http://api.torob.com/ and https://emalls.ir/.
Sessionization
In order to perform more advanced analysis, we need to assign session ids to the entries in our access log. They’re our stand-in for Cookies. Our assumptions for a single session are:
- the same IP address
- page requests need to be less than 30 minutes apart (this can of course be adjusted)
As an example, here is one session identified with above assumptions.
So, session feature generated, let’s move on to common navigation patterns.
Navigation
We’re using a Sankey diagram to visualize common navigation paths on the website.
Sessions are indicated by the color of the edges and nodes are the pages visited. Popular links from the landing page are “/browse/home-appliances” and “/browse/digital-supplies”. From home appliances the majority of sessions lead on to “/browse/big-kitchen-appliances” and a few to “/browse/audio-and-video-equipment”.
Geo location
IP addresses contain information about the location of the user, if they do not use some kind of proxy. And if proxy’s are widely used, you can usually identify the favorite proxy location :)
Only a small fraction of IP addresses could be located with the service used here (http://api.hostip.info/get_html.php). There are plenty of other services to try which is beyond the scope of this blog post. What can be seen already is that all of the identified addresses lie outside of Iran, with two clusters located in Germany and China.
TL;DR
Based only on web server logs, it’s possible to generate several insights into website usage. This is an alternative to using Cookies.
The complete code for this blog post can be found here.