Import large server log files in Piwik and set a cron job to do it automatically

You want to get rid of Google Analytics, don’t you? Piwik is a great open source alternative, and today we’re going to see how to import your old webserver access logs and how to set an automatic script to do it programmatically.

Piwik-logoI assume you have Piwik and python installed. If you don’t, go do it. Easy as a pie.

Here’s the line to get the access log file and import it in your Piwik site (be sure to set the correct –idsite):

python /path-to-piwik/misc/log-analytics/import_logs.py --url=http://your-piwik-public-url/ /var/www/logs/access.log.gz --idsite=X --enable-http-redirects --enable-http-errors --enable-bots --enable-static --recorder-max-payload-size=300

If you have very large access log files it’s likely that Piwik will hang. To split them, use this command:

zcat file.log.gz | split --lines=50000 -d - file.log.gz

What this does is extract the .gz log file, and then split it every 50000 lines (to be sure it doesn’t get truncated in the middle of a line) reading the input from stdin. 50000 lines should be ok and correspond to 10Mb circa, depending on how your logs are formatted.

Now that you have your logs imported you have to consolidate your piwik database with this PHP script:

php /path-to-piwik/console core:archive --url=http://your-piwik-public-url/ > /var/log/piwik/`date +\%G-\%m-\%d`-archive.log

Make sure you have a /piwik/ folder in /var/log/.

Once you have the “old” access logs imported, you can set a cron job to automatically rotate your webserver logs and import them in Piwik every day.

find /var/www/ -ipath */logs/access.log.gz -execdir mv "{}" "old/`date +\%G-\%m-\%d`-access.log.gz" \;
wait
nginx -s reopen
wait
python /var/www/default/piwik/misc/log-analytics/import_logs.py --url=http://your-piwik-public-url/ /var/www/vhost/logs/old/`date +\%G-\%m-\%d`-access.log.gz --idsite=X --enable-http-redirects --enable-http-errors --enable-bots --enable-static --recorder-max-payload-size=300
wait
php /var/www/default/piwik/console core:archive --url=http://your-piwik-public-url/ > /var/log/piwik/`date +\%G-\%m-\%d`-archive.log

The magic explained: the first line rotates all access.log.gz file found in /var/www/ (so if you have multiple vhosts this can account that). Third line reopens nginx (change to relative apache command if you have it) logs to keep track of new accesses), five line imports the newly created file (which has a “date” prefix) and the last line consolidates the database.

Now save this snippet to a .sh file and add it to the crontab:

* 0		* * *	root	bash yourscript.sh > /dev/null 2>&1

If you want to get rid of some non meaningful lines from your logs (for example the recurrent wp-cron.php call WordPress does, or the wp-login.php page) you can modify the python line (the 5th in the previous snippet) like this:

zgrep -Ev "(wp-login.php|wp-cron.php|wp-admin|xmlrpc.php|?custom-css|/feed/)" /var/www/vhost/logs/old/`date +\%G-\%m-\%d`-access.log.gz | python /path-to-piwik/misc/log-analytics/import_logs.py --url=http://your-piwik-public-url/ - --idsite=X --enable-http-redirects --enable-http-errors --enable-http-redirects --enable-http-errors --enable-bots --enable-static --recorder-max-payload-size=300

Let’s see what changed: zgrep searches the (compressed) input file(s) for lines containing a match, in this case wp-login.php, wp-cron.php, etc. The -v parameter inverts the sense of matching, to select non-matching lines. Finally, it writes the lines purged of the matching ones to stdin. Then the python import script reads from stdin and sends everything to Piwik.

Leave a Reply

Your email address will not be published. Required fields are marked *