mapzen/wikipedia-counter

Name: wikipedia-counter

Owner: Mapzen

Description: null

Created: 2015-10-20 22:06:59.0

Updated: 2016-08-27 17:30:47.0

Pushed: 2015-10-26 18:18:17.0

Homepage: null

Size: 160

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Wikipedia Pageview Counter

This repo contains code to take hourly page view count data from Wikipedia, and construct page counts for longer periods using PostgreSQL.

Note: This requires PostgreSQL 9.5 to take advantage of the new UPSERT feature, or it's just too slow

How it works

Hourly page view files are downloaded on demand (or you can download them yourself and read directly from the files), a Node script parses the files and imports relevant data into Postgres, and finally fun queries can be run against the resulting data.

Requirements

So far this has been used only to aggregate a single month of Wikipedia logs.

Instructions
  1. Set up a PostgreSQL 9.5 beta1 or newer instance (with PostGIS)
  2. Download the Wikipedia world PostGIS dump from here
  3. Gernerate a list of pagecount files to download (currently manual, a sample for Sept 2015 is included)
  4. Read and modify crunch.sh to suit your needs, run it, and wait
  5. View the output of top viewed pages with location data that is calculated automatically
  6. Run other interesting queries and report back!

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.