This tutorial will be dedicated to understanding how to use the linear regression algorithm with Wikidata to make predictions. For a detailed explanation of how this algorithm works please read the Wikipedia article: linear regression. In this walkthrough Python is used.

Importing Modules/Packages

Before we start coding, import/install all of the following packages: NumPy, Pandas, and Sklearn.

# -*- coding: utf-8  -*-
import json
import numpy as np
import pandas as pd
import sklearn

from collections import defaultdict
from sklearn import linear_model

Loading in Our Data

Now it’s time for some data collection from Wikidata. For this example we are using the yearly (average) population stacked by country in a query (linked further down). This gives us a lot of interesting values and some with faults, unfortunately. I have chosen to filter this query to only include values from 2005 and newer. How you choose to import the query into the script is your decision. A passibility is to iterate over a SPARQL query by downloading the .rq file or just download a JSON file of the result from the query.wikidata.org site. Once you’ve downloaded the data set and placed it into your main directory (of the Python code) you will first need to clean the data, and later load it in using the pandas module.

Yearly Population stacked by country

# male/female population _must_ not be added unqualified as total population (!)
# this is an error and should be fixed at the item using P1540 and P1539 instead
# (wrong query result may be a manifestation of such)
SELECT ?year (AVG(?pop) AS ?population) ?countryLabel
WHERE
{
  ?country wdt:P31 wd:Q6256;
           p:P1082 ?popStatement .
  ?popStatement ps:P1082 ?pop;
                pq:P585 ?date .
  BIND(STR(YEAR(?date)) AS ?year)

  # IF multiple ?pop values per country per year exist, we prioritize by source
  #       census 1st, others 2nd, estimation(s) 3rd, unknown sources (none supplies P459) last
  # note: wikibase:rank won't help here: each year may have multiple statements for ?pop value
  #       rank:prefered is used for the best value (or values) of the latest or current year
  #       rank:normal may be justified for all of multiple ?pop values for a given year
  OPTIONAL { ?popStatement pq:P459 ?method. }
  OPTIONAL { ?country p:P1082 [ pq:P585 ?d; pq:P459 ?estimate ].
             FILTER(STR(YEAR(?d)) = ?year). FILTER(?estimate = wd:Q791801). }
  OPTIONAL { ?country p:P1082 [ pq:P585 ?e; pq:P459 ?census ].
             FILTER(STR(YEAR(?e)) = ?year). FILTER(?census = wd:Q39825). }
  OPTIONAL { ?country p:P1082 [ pq:P585 ?f; pq:P459 ?other ].
             FILTER(STR(YEAR(?f)) = ?year). FILTER(?other != wd:Q39825 && ?other != wd:Q791801). }
  BIND(COALESCE(
    IF(BOUND(?census), ?census, 1/0),
    IF(BOUND(?other), ?other, 1/0),
    IF(BOUND(?estimate), ?estimate, 1/0) ) AS ?pref_method).
  FILTER(IF(BOUND(?pref_method),?method = ?pref_method,true))
  # .. still need to group if multiple values per country per year exist and
  # - none is qualified with P459
  # - multiple ?estimate or multiple ?census (>1 value from same source)
  # - ?other yields more than one source (>1 values are better than optionally
  #                         supplied estimate, but no census source available)

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
  FILTER(?year >= "2005")
}
GROUP BY ?year ?countryLabel
ORDER BY ?year ?countryLabel

Query found on Wikidata:SPARQL query service/queries/examples/advanced (shout-out to the person who made it, saved me a lot of time). Click here for full size image (CC0).

Visualisation of the SPARQL query

Now that we have cleaned the data and selected the interesting part of the query (country, year and population). We need to import the data into pandas. We also need, in this example, to flip the table (switch the place of column and row).

YEARS = ["2007", "2008", "2009" ,"2010", "2011", "2012", "2013"] # Years we are interested in


def getList(dict): # To get keys for the dict.
    list = []
    for key in dict.keys():
        list.append(key)

    return list

with open('query.json', 'r') as f: # Downloaded query in a JSON file.
    distros_dict = json.load(f)

allEntries = defaultdict(dict) # saves all the countries in the query with its data

for entry in distros_dict:
    allEntries[entry['countryLabel']].update({entry['year']: entry['population']})

selectedEnt = defaultdict(dict) # saves the countries in the query with its data that has all the values in the YEARS list

for country in allEntries:
    if all(elem in getList(allEntries[country]) for elem in YEARS):
        selectedEnt.update({country: allEntries[country]})

df = pd.DataFrame.from_dict(selectedEnt) # pastes it into pandas
data = pd.DataFrame.transpose(df) # flips the table

The data should now look something like this: print(data)

.                2007     2008     2009  \
Afghanistan    26349243 27032197 27708187
Algeria        35097043 35591377 36383302
... ... ... ...

Next it’s time to only select the data we want to use as test data, and remove the solution; in other words split the data. In this example I have choose to use population values from 2007-2012 (for the countries that have data for all the years between), where sklearn will do a prediction for 2013 (we also need the real value for this year).

data = data[YEARS]

predict = "2013"

Now that we’ve trimmed our data set down we need to separate it into 4 arrays. However, before we can do that we need to define what attribute we are trying to predict. This attribute is known as a label. The other attributes that will determine our label are known as features. Once we’ve done this we will use numpy to create two arrays. One that contains all of our features and another one that contains our labels.

X = np.array(data.drop([predict], 1)) # Features
y = np.array(data[predict]) # Labels

After this we need to split our data into testing and training set. We will use 90% of our data to train and the other 10% to test. The reason we do this is, so that we do not test our model on data that it has already seen.

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

Next is to implement the linear regression algorithm.

Implementing the Algorithm

We will start by defining the model which we will be using.

linear = linear_model.LinearRegression()

Next we will train and score our model using the arrays.

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test) # acc = accuracy

To see how well our algorithm performed on our test data we can print out the accuracy.

print(acc)

For this specific data set a score of above 80% is fairly good. This example has 99%.

Viewing The Constants

If we want to see the constants used to generate the line we can type the following.

print('Coefficient: \n', linear.coef_) # These are each slope value
print('Intercept: \n', linear.intercept_) # This is the intercept

Predicting the population in 2013

Seeing a numeric score value is nice but we would first like to see how well the algorithm works on a specific country. To do this we are going to print out all of our test data. Beside this data we will print the actual population in 2013 and our models predicted population value.

predictions = linear.predict(x_test) # Gets a list of all predictions

print("Country - sklearn guessed value for 2013, the Wikidata values (2007-2012), The Wikidata value (2013)")
for x in range(len(predictions)):
    for country in selectedEnt:
        if x_test[x][0] == selectedEnt[country][YEARS[0]] and x_test[x][1] == selectedEnt[country][YEARS[1]]: # To find the country used in the test data
            print(country, " - ", predictions[x], x_test[x], y_test[x])

Test result

0.999650607098148
Coefficient:
[ 0.41969474 -1.01050159 -0.20560013  0.0411049   1.3388236   0.41479332]
Intercept:
36691.20709852874
Countrysklearn guessed value for 2013The Wikidata values (2007)The Wikidata values (2008)The Wikidata values (2009)The Wikidata values (2010)The Wikidata values (2011)The Wikidata values (2012)The Wikidata value (2013)
Bhutan791284.69679365692159704542716939729429741822753947
Palau57549.3020118202282034420470206062075420918
Venezuela30466443.1827655937281203122858304029043283295006252995478230405207
Romania19986225.9020882982205378752036748720246871201475282005803519981358
Uruguay3439645.733338384334889833604313371982338348633952533407062
........

Full code

# -*- coding: utf-8  -*-
import json
import numpy as np
import pandas as pd
import sklearn

from collections import defaultdict
from sklearn import linear_model

YEARS = ["2007", "2008", "2009", "2010", "2011", "2012", "2013"]


def getList(dict):
    list = []
    for key in dict.keys():
        list.append(key)

    return list

with open('query.json', 'r') as f:
    distros_dict = json.load(f)

allEntries = defaultdict(dict)

for entry in distros_dict:
    allEntries[entry['countryLabel']].update({entry['year']: entry['population']})

selectedEnt = defaultdict(dict)

for country in allEntries:
    if all(elem in getList(allEntries[country]) for elem in YEARS):
        selectedEnt.update({country: allEntries[country]})

df = pd.DataFrame.from_dict(selectedEnt)
data = pd.DataFrame.transpose(df)

data = data[YEARS]

predict = "2013"

X = np.array(data.drop([predict], 1)) # Features
y = np.array(data[predict]) # Labels

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)

print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

predictions = linear.predict(x_test)

print("Country - sklearn guessed value for 2013, the Wikidata values (2007-2012), The Wikidata value (2013)")
for x in range(len(predictions)):
    for country in selectedEnt:
        if x_test[x][0] == selectedEnt[country][YEARS[0]] and x_test[x][1] == selectedEnt[country][YEARS[1]]:
            print(country, " - ", predictions[x], x_test[x], y_test[x])

Jupyter page (PAWS)