Last Updated:

Improve the site recommendation system based on user behavior

Many news sites and blogs have a recommendation system. It is needed so that the user who reads the article can be shown other interesting articles.

From how competently selected the recommendations are, it depends on how much time people will spend on the site, looking through new articles. This factor is very important for increasing the position of the site in search engines.

In this article, I will share one of the ways in which you can improve the accuracy of the recommendation system on a news site or blog.

Previously, in yourdomain.com, a tagging system was used to select recommendations. That is, for the article being viewed, other articles with the maximum tag match were searched for and issued as recommendations.

The problem is that this method does not always give a good result in terms of the user's interest in the recommendations shown to him.

It was decided to take a different approach - to show first of all those articles to which users most often go from the page of the current article.

Google Analytics and Yandex Metrica have tools such as a map of user transitions to the pages of the site. Unfortunately, in neither system at the time of publication, it is possible to export the graph of visits in a format that can simply be parsed with a script.

In both systems, it is only possible to export a graphical representation of the visit map to a PDF file. Googling, it turned out that in the paid Google Analytics there is the ability to export the graph of visits to a machine-readable format. But, unfortunately, I do not have a paid GA account, and I had no desire to purchase a subscription for such a trifle.

Remembering that more than half a gigabyte of data gets into the site logs per month, I wrote a simple parser on a python that parses the log of the web server and records the data on visits to the database for subsequent analysis.

The parser opens the file, reads the data from it line by line, parses each row into a separate instance of the LogItem class, and places that instance in a list of insert_pool.

When the size of the list reaches two thousand records, data is inserted into the database through a single INSERT query. After that, the list is cleared and parsing continues until all the data in the logs are transferred to the database.

# Format for representing dates in logs
LOG_DATE_FORMAT = '%d/%b/%Y:%H:%M:%S'
# The maximum number of records that are inserted into the database in one request
INSERT_POOL_SIZE = 2000

class Token:
    """Token for parsing logs"""

    SPECIAL_CHAR = 1
    STRING=2

    type=none
    content=""

    def __init__(self, type=None, content=""):
        self.type = type
        self.content = content

def tokenize_log_item(string):
    """Split the log entry into tokens"""

    tokens = []
    tok = Token(None, "")
    for char in string:
        if char in '"[] ':
            if tok.type == Token.STRING:
                tokens.append(tok)
            tok = Token(Token.SPECIAL_CHAR, char)
            tokens.append(tok)
            continue
        if tok.type != Token.STRING:
            tok = Token(Token.STRING, "")
        tok.content += char
    return tuple(tokens)

def parse_log_item(string):
    """Parse the log entry and return a dictionary with fields"""

    result = {}
    tokens = tokenize_log_item(string)
    datestring = tokens[7].content
    date = datetime.datetime.strptime(datestring, LOG_DATE_FORMAT)
    result['ip'] = tokens[0].content
    result['date'] = date
    result['method'] = tokens[13].content
    result['uri'] = tokens[15].content[:255]
    result['referrer'] = tokens[25].content[:255]
    try:
        result['status'] = int(tokens[20].content)
    exceptValueError:
        result['status'] = 200
    return result

Remember that the log format of your web server may be different from mine. Below is an example of a log entry, under the format of which the parser was written.

xxx.xxx.xxx.xxx - - [18/Apr/2015:05:03:14 +0300] "GET /post/cpp-hello-world/comments/feed/ HTTP/1.1" 200 4221 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36" "3.45"

This site is written in Django, so I used the built-in system of console commands and models to run log parsing and work with the database.

I created an analytics application in which I defined models for storing data on site visits. Inside the directory of the application added a file that will be called once a week by kroon and update the table of visits based on new logs.

management/commands/collect_page_views.py

from django.core.management.base import BaseCommand
from django.db import transaction
from analytics.models import PageView
import datetime
import logging

logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__file__)

classCommand(BaseCommand):
    help = u'''Parses server logs and saves data to a table \
by visited pages'''
    args = '<access_log_path>'

    def handle(self, log_path, *args, **options):
        insert_pool = []
        pool_size = 0
        count = 0

        last_page_view = PageView.objects.filter().\
            order_by('-date')[:1]
        last_view_date = None
        if last_page_view.exists():
            last_view_date = last_page_view[0].date

        sid = transaction.savepoint()

        try:
            with open(log_path) as f:
                while True:
                    line = f.readline()
                    if not line:
                        break
                    data = parse_log_item(line)
                    if last_view_date and data['date'] <= last_view_date:
                        # Old logs are not saved again
                        continue
                    page_view = PageView(**data)
                    insert_pool.append(page_view)
                    pool_size += 1
                    count += 1
                    logger.info('Processing log item #%d' % count)

                    if pool_size >= INSERT_POOL_SIZE:
                        logger.warn('Bulk creating models...')
                        PageView.objects.bulk_create(insert_pool)
                        pool_size = 0
                        insert_pool = []

                if pool_size > 0:
                    logger.warn('Bulk creating models...')
                    PageView.objects.bulk_create(insert_pool)

        except Exception as ex:
            transaction.savepoint_rollback(sid)
            raise ex

        transaction.savepoint_commit(sid)
        logger.warn('Total rows created: %d' % count)

Here's an example of a command that updates visit data in a database.

python manage.py collect_page_views /home/www/yourdomain.com/logs/access.log

The first run of the script added more than two million records to the database. And it took about twenty minutes.

In subsequent times, the script will ignore the entries of old logs that have already been added to the database.

Based on this data, we will build a map of user transitions between articles. We'll limit the sampling to the fields and do the grouping by . Thus, we will get data on the number of transitions of users from the page of one article to others.

urirefereruri

According to this data, we will already be able to show improved recommendations for any article on the site.

The following is the code of a file that updates the data on transitions between articles based on the data on visits.

management/commands/update_recommendation.py

# encoding:utf-8

import logging

from django.core.management.base import BaseCommand
from django.db import transaction
from django.db.models import Count

from analytics.models import PageView, Recommendation

logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__file__)

SITE_URL = 'https://yourdomain.com'
INSERT_POOL_MAX_SIZE = 2000

classCommand(BaseCommand):
    help = u'Update article suggestions'

    def handle(self, *args, **options):
        page_views = PageView.objects.filter(
            method='GET',
            status=200,
            uri__startswith='/post/',
            referer_startswith=SITE_URL + '/post/'
        ).values('uri', 'referrer')\
         .annotate(weight=Count('referer'))

        referrer_offset = len(SITE_URL)
        count = 0
        insert_pool = []
        insert_pool_size = 0

        sid = transaction.savepoint()
        Recommendation.objects.all().delete()

        for view in page_views:
            source = view['referrer'][referrer_offset:]
            target = view['uri']
            weight = int(view['weight'])
            if source == target:
                continue
            recommendation = Recommendation(source=source,
                                            target=target,
                                            weight=weight)
            insert_pool.append(recommendation)
            insert_pool_size += 1
            logger.info('Processing recommendation #%d' % count)

            if insert_pool_size >= INSERT_POOL_MAX_SIZE:
                insert_pool_size = 0
                logger.warn('Insert pool size is exceeded')
                logger.warn('Bulk creating models...')
                recommendation.objects.bulk_create(insert_pool)
                insert_pool = []
            count += 1

        if insert_pool_size > 0:
            logger.warn('Bulk creating models...')
            recommendation.objects.bulk_create(insert_pool)

        transaction.savepoint_commit(sid)
        logger.warn('Total rows created: %d' % count)

Recommendations are loaded via Ajax when opening an article so as not to slow down the page loading. The following is the Django View code that returns recommendations for the article.

For new articles for which navigation data has not yet been collected, the old method of selecting recommendations for tags is used.

class RecommendedArticlesView(View, JsonViewMixin):
    "Recommended Articles"

    MAX_COUNT = 10 # Maximum number of recommendations

    def _bigdata_failback(self, **kwargs):
        "Recommendation based on tags"

        context = {}
        if 'post_id' in self.kwargs and self.kwargs['post_id'] is not None:
            article = get_object_or_404(models.Article,
                                        id=self.kwargs['post_id'])
            context['title'] = 'Read related articles'
            posts = get_related_articles(article, self.MAX_COUNT)
        else:
            context['title'] = 'Recommended Articles'
            posts = models.Article.public.filter().\
                values('title', 'slug').order_by('-id')[:self.MAX_COUNT]
        context['posts'] = []
        for post in posts:
            context['posts'].append({
                'url': reverse('blog-article', args=(post['slug'],)),
                'title': post['title'],
            })
        return context

    def _bigdata_recommendations(self, **kwargs):
        "Selection of recommendations based on previous visits"

        article=None
        if 'post_id' in self.kwargs and self.kwargs['post_id'] is not None:
            article = get_object_or_404(models.Article,
                                        id=self.kwargs['post_id'])
        if not article:
            return[]

        article_url = reverse('blog-article', args=(article.slug,))
        recommendations_urls = Recommendation.objects.filter(
            source=article_url).order_by('-weight').\
            values_list('target', flat=True)[:self.MAX_COUNT]
        if len(recommendations_urls) == 0:
            return[]

        slugs = []
        for url in recommendations_urls:
            try:
                resolve_match = resolve(url)
            except http.http404:
                continue
            if resolve_match.url_name == 'blog-article':
                slugs.append(resolve_match.kwargs['slug'])

        result = []
        for slug in slugs:
            try:
                post = models.Article.public.filter(slug=slug).\
                    values('slug', 'title')[:1][0]
            except IndexError:
                continue
            result.append({
                'url': reverse('blog-article', args=(post['slug'],)),
                'title': post['title'],
            })
        return result

    def get_context_data(self, **kwargs):
        context = {
            'title': 'Recommended Articles',
            'posts': self._bigdata_recommendations(**kwargs)
        }
        if len(context['posts']) == 0:
            context = self._bigdata_failback(**kwargs)
        return context

    def get(self, *args, **kwargs):
        return self.render_to_response(self.get_context_data())

The recommendations selected in a new way have definitely become more relevant.