results for UNESCO query

Part 2: Intro

This is part two of a three part series.

In part 1 of this series I wrote about using Elixir to parse Wikivoyage data dumps and push them into Neo4j. If you want to follow along at home but you have no data locally, then you'll want to go back to that section before reading further. If you've already seen part 1 or if you're only interested in the resulting data structure and its interrogation, then you can read on from here.

Exploring the Data

To review from the first part of this series, what we have at this point is a simple but reasonably large graph, where nodes are WikiVoyage pages. Most nodes1 have properties for body (i.e. unparsed page content), a title (i.e. the article title), and a page_id. As far as edges, there are at this point only two kinds of relationships:

  1. the links_to relationship means our graph models how the webpage connects to itself
  2. the contains relationship gives some extra semantic information about geopolitical regions

In this part of the writeup, we'll interrogate and improve the data model. What kind of questions can be asked and answered using it? First a brief aside about graph visualization and queries..

The web UI for Neo4j is at port 7474 by default or more specifically http://10.0.2.2:7474 in the (unlikely) event you have exactly the same dual-guest vagrant setup as I do. Point your webbrowser there and play around. You can also see some vital statistics about the growth rate of nodes and relationships over time here at the /webadmin endpoint (for instance http://10.0.2.2:7474/webadmin/).

There are several CLI clients available, but I recommend using cycli or ipython-cypher.

The Cypher Query Language is of course used to interrogate the graph, regardless of whether that is happening from Elixir code, from the WebUI, or from the CLI console. There are lots of introductions to cypher out there but getting deep into that is outside of the scope of this write up and I'm still learning myself. CQL is fairly readable though so the reader's intuition, plus this quick reference, plus brief commentary will hopefully suffice.

Curating the Graph

So all preliminaries aside, this is the main event of this part of the series. We've now reached some kind of critical mass with our model, and it's possible to use information in the graph to begin to improve the graph itself. I'll be using Cypher exclusively for this but keep in mind that the queries here are intended to be simple and illustrative rather than performant.

Finding Continents

Can we discover which nodes (that is, wikipages) correspond to continents? Yes! This is actually kind of a neat trick, because it's not like pages are labeled as continents, we don't have access to that much wikimedia ontology explicitly. Instead, one has to recognize: continents are those nodes which "contain" other nodes but are not themselves contained. A first guess at a naive query might look something like this

// On the right track, but not quite correct
MATCH (continent:WikiPage)-[:contains]->(m:WikiPage)
WHERE NOT ()-[:contains]->(continent:WikiPage
RETURN DISTINCT continent limit 30

The simple query doesn't quite work because of a few warts in the datastructure. We'll fix most of this stuff later in the curation section of this write-up, but for now just to list the short-comings:

  1. Pages without page_id's are linked to from somewhere but do not actual exist (or have not been parsed yet because mix load was interrupted). These have to be excluded by making sure that the page_id property is set.
  2. Redirect pages are somewhat like continents in that they are "root nodes" that "contain" no other entities on the graph. These have to be excluded using a regex on the page body. Similarly for disambiguation pages.

With all this in mind you can puzzle out a more correct cypher query.

// A better query that takes into account warts of the graph
MATCH (continent:WikiPage)-[:contains]->(:WikiPage)
WHERE NOT (
  ()-[:contains]->(continent:WikiPage) OR
  continent.page_id IS NULL OR
  continent.body =~ '.*#REDIRECT.*' OR
  continent.body STARTS WITH "{{pagebanner|Disambiguation" )
RETURN DISTINCT continent LIMIT 30

In the screenshot below you can see that amongst the nodes returned as continents there are also nodes representing the pages for "Eurasia" and "Other Destinations". This amount of error seems negligible though, so it's not worth adding special cases to exclude them

continents query

Continent Labels

Having found continents, we should label them to avoid such an awkward query in the future. While we're at it, let's make use of the various subparts of this query to improve other graph semantics by adding new labels for :Redirect and :Disambiguation.

// Query to label redirect pages as :Redirect
MATCH (page:WikiPage) WHERE page.body =~ '.*#REDIRECT.*'
SET page :Redirect
RETURN count(page)

// Query to label disambiguation pages as :disambiguation
MATCH (page:WikiPage)
WHERE page.body STARTS WITH "{{pagebanner|Disambiguation"
SET page :Disambiguation
RETURN count(page)

Now we can simplify the previous "find continents" query while we're adding the :Continent label

MATCH (continent:WikiPage) -[:contains]-> (:WikiPage)
WHERE NOT (
  ()-[:contains]->(continent:WikiPage) OR
  continent.page_id IS NULL OR
  continent:Redirect OR
  continent:Disambiguation )
SET continent :Continent
RETURN DISTINCT continent LIMIT 30

Regional Hierarchies

What about other geopolitical or administrative hierarchies? Can we discover countries, cities, etc? There are a few obvious things we can do here, like taking advantage of patterns in page titles to find counties, provinces, and regions. Check out the relevant cypher below.

// Query to label province pages as :Province
MATCH (page:WikiPage)
WHERE page.title ENDS WITH "(province)"
SET page :Province
RETURN COUNT(page)

// Query to label county pages as :County
MATCH (page:WikiPage)
WHERE page.title ENDS WITH "(county)"
SET page :County
RETURN COUNT(page)

// Query to label region pages as :Region
MATCH (page:WikiPage)
WHERE page.title ENDS WITH "(region)"
SET page :Region
RETURN COUNT(page)

That's as much milage as we're likely to get out of title data for pages, but there's parts of page body that can also give us some insight.

It's clear at this point that the Wikivoyage :contains relationship hierarchy is somewhat complicated. As an example, since it is the case that Europe (a continent) contains Iberia (a region) which contains Spain (a country), there's no way that a naive query like "which nodes are exactly 1 :contains edge away from a continent" can be trusted to return only countries.

Looking now to page body contents to eke out more structure, certain domain-specific WikiMedia knowledge can be useful, namely article status classification templates, for instance {{UsableCity}}.

It's important to note that up until now we've only been using Cypher regexes on title properties, but to find {{UsableCity}} macros inside body properties we'll essentially be repeatedly grepping like half a gigabyte of text.

I'll continue to use regexes and string functions in Cypher, but in production you should probably be using the full-text lucene search capabilities 2.

// Label pages invoking the {{UsableCity}} macro as :City
MATCH (n:WikiPage)
  WITH n, lower(n.body) AS body
  WHERE
    body CONTAINS "{{usablecity}}" OR
    body CONTAINS "{{guidecity}}" OR
    body CONTAINS "{{starcity}}"
SET n :City
RETURN count(n)

For other labels we can do likewise and as a result our graph data structure will model the WikiVoyage ontology more and more closely. Surprisingly, the {{UsableCountry}} template macro is not very useful so we skip that in the cypher code below.

// Label pages invoking the {{UsableAirport}} macro as :Airport
MATCH (n:WikiPage)
  WITH n, lower(n.body) AS body
  WHERE
  body CONTAINS "{{usableairport}}" OR
  body CONTAINS "{{guideairport}}" OR
  body CONTAINS "{{starairport}}"
SET n :Airport
RETURN count(n)

// Label pages invoking the {{UsableRegion}} macro as :Region
MATCH (n:WikiPage)
  WITH n, lower(n.body) AS body
  WHERE
  body CONTAINS "{{usableregion}}" OR
  body CONTAINS "{{guideregion}}" OR
  body CONTAINS "{{starregion}}"
SET n :Region
RETURN count(n)

Finding Countries

Having correct and fairly complete :Country labels for the graph would be really nice. But based on this page it seems that identifying countries with the {{UsableCountry}} macro in page bodies is not painting a very complete picture.

To improve the model for our toy world (and to demonstrate some other neat things you can do with cypher) we'll have to cheat a bit by using a external (non WikiVoyage) data source. I opted to use this country list, provided via the OpenTravelData project. I massaged the SQL from that data source into a CSV file, then loaded it with Cypher.

For completeness there's a footnote3 describing how I normalized the data, but you can also just use my normalized version here.

So the strategy is that the massaged data inside country-list.txt is a comma-free text file with newlines, thus it is a kind of CSV file, and Cypher happens to do CSV parsing. To keep Neo4j happy, you must have the CSV file somehow available to the Neo4j server. I wanted to avoid copying a file to my virtual machine, so the code below uses a HTTP link into the rapidmind repository.

// Query to label country nodes as :Country
WITH 'https://raw.githubusercontent.com/mattvonrocketstein/rapidmind/master/country_list.txt'
AS csv_url
LOAD CSV FROM csv_url AS csv_line
WITH csv_line[0] AS country_name
MATCH (n:WikiPage) WHERE n.title=country_name
SET n:Country
RETURN count(n)


Since the total number of countries at any given time already depends on who you ask.. it's not that important if the model is actually 100% correct. However, a quick query in the web UI suffices to show that the approach has been more or less successful.



Redirect Edges

How about incorporating :redirects_to edges to the model? Luckily MediaWiki style redirection is pretty easy to detect. It's tempting also to copy attributes and labels "across" the redirection edge, but I'm trying to resist this until there's a specific use-case in mind (probably it would improve certain queries but be to the detriment of others).

MATCH (source:Redirect)
    WITH
      source.title as title,
      split(split(split(source.body,'[[')[1],']]')[0],'#')[0] as target_title
MATCH (target:WikiPage {title: target_title})
CREATE UNIQUE (source)-[:redirects_to]->(target)
RETURN count(*)

Querying the Graph

Pathfinding

As we've seen previously in the curating section, regional hierarchies like [continent :arrow_right: country :arrow_right: city] are already defined and accessible by the :contains edges. How can we query for such paths? Given a specific start node, for instance the page for Bangkok, we can write a query that finds the path going upwards toward the continent like so:

MATCH
  (start:WikiPage {title: "Bangkok"}),
  (end:Continent), p = shortestPath((end)-[:contains*]->(start))
RETURN p


Following the blue edges, the result shows that Bangkok is contained by Chao Phraya Basin is contained by Central Thailand is contained by...

bangkok shortest path

Innate Ambiguity

How frequently does ambiguity occur in place names? This is a tricky question to answer exactly with the data but by counting the redirects 4 and disambiguation pages we can pretty easily find information that gets us pointed in the right direction. The basic approach is

  1. Find all pages labeled as redirects. The page body will look like #REDIRECT [[RedirectTarget]]
  2. Find the page each page redirects to, de-duplicate the resulting set and count the elements

No doubt there's a better way to do it, but the cypher I came up with looks like what you see below. The split() calls you see there are taking apart the `[[link]] markdown and the reference to "#" is handling a special-case where some pages redirect to subsections of other pages.

// Counts unique redirection targets
MATCH (page:Redirect)
WITH
  page.title as title,
  split(split(split(page.body,'[[')[1],']]')[0],'#')[0] as redirect_to
RETURN count(distinct redirect_to)

// Counts disambiguation pages
MATCH (page:Disambiguation) RETURN count(distinct page)
RETURN count(distinct page)

Popular Topics

What can we say about nodes (pages) with large numbers of out-going links? Apriori, there are a a few obvious guesses: these pages might represent meta-pages, very large land-masses, or very well-documented (and thus popular?) travel destinations. You can see several such nodes in the output of the cypher query below:

// Query for finding the top-ten nodes by link-count
MATCH (n)-[r:links_to]-()
RETURN n, count(r) as rel_count
ORDER BY rel_count desc
LIMIT 10

Taking for instance the meta-page for the UNESCO World Heritage List here are some interesting queries:

// Query to count the out-going links of a particular node
MATCH (page)-[rel]->()
WHERE page.title='UNESCO World Heritage List'
RETURN count(rel)

// Query to graph part of the neighborhood in the WebUI
MATCH (page:WikiPage)-[rel]->()
WHERE page.title='UNESCO World Heritage List'
RETURN rel LIMIT 10
  1. That is, all nodes with the exception of pages that are linked to from somewhere but do not actually exist
  2. which I've written about here
  3. wget -O country_list.sql https://raw.githubusercontent.com/opentraveldata/opentraveldata/master/data/countries/27org/27org_iso_country_list_2007_12.sql && cat country_list.sql | grep INSERT | awk -F "(" '{print $2}'|awk -F ",'" '{print substr($3,0,length($3))}'|awk -F "'," '{print $1}'cat country_list.sql | grep INSERT | awk -F "(" '{print $2}'|awk -F ",'" '{print substr($3,0,length($3))}'|awk -F "'," '{print $1}'
  4. We already labeled redirects here
Part 2 of 3 in "Exploring WikiVoyage data with Neo4j and Cypher: Part 1 (Loading Data)"   next ⇒
Previous article(s)
Next article(s)