Terminology is constantly evolving and often confusing. This is especially true in tech with the constant re-defining/re-naming/re-branding/re-implement churn that companies undertake to make their product stand out. A few months ago some questions appeared in a thread regarding the use cases for knowledge graphs. Knowledge graphs and graph databases had a resurgence in popularity the last few years and as I’ve got a background with both, I felt mildly qualified to opine. That thread was a filled with terminology. Some people were talking about knowledge graphs, others the semantic web (OWL, RDF) and graph databases. Some were using terms incorrectly. I’d like to clarify these terms and how they are related in a short-and-sweet history lesson.
The Semantic Web
Back when the internet was an infant, Tim Berners-Lee helped publish an article that defines the Semantic Web. He has described it as:
[The Semantic Web] is about making links, so that a person or machine can explore the web of data.
It was an effort to represent all knowledge in a single connected structure. Imagine a family tree, where you have people linked with their relatives. You can follow a link called ‘mother’ to find first a mother, then a grandmother, then a great-grandmother and so on. The goal of the semantic web is to scale this out to represent all information. Data points are all connected with the metadata about how they are connected right there.
Making the semantic web a reality required defining a machine-readable structure. Many frameworks were used, the two most popular being RDF (‘Resource Description Framework’) and OWL (‘Web Ontology Language’). RDF predates the semantic web, but OWL was created as a consequence. Both remain popular and both are W3C specifications.
RDF is a data structure used to represent the web itself and to organise knowledge into statements. Statements look like logical triples, consisting of a subject, a predicate and an object. For example (taken from Wikipedia) the statement ‘The sky has the colour blue’ can be broken down into:
1. Subject: ‘the sky’
2. Predicate: ‘has the colour’
3. Object: ‘blue’
Translated into a graph, the predicates become relationships and the subjects and objects become nodes. OWL represents the schema (ontology) of the graph. It is used to model your domain and meant to coexist with your data in the RDF store. The OWL specification complies with the RDF framework- the ontologies are also written using logical triples. Continuing the example, an OWL ontology would define the concepts of ‘geographical feature’ and ‘colour’ to represent ‘the sky’ and ‘blue’, respectively. These concepts would be linked using another predicate, for example ‘is of type’, to the objects themselves, embedding the metadata with the data.
Despite being powerful modelling tools, semantic web technologies were never widely adopted outside of research circles. They can be difficult to use and understand- being rooted in predicate logic and represented by XML, they do not translate well to being readily adopted.
Graph databases provide a place to store data in a graph structure, much like Postgres is a storage location for tabular data. There are many popular graph databases out there, such as Neo4j and the open source JanusGraph (previously known as TitanGraph). Graph databases are not the same thing as knowledge graphs. Knowledge graphs add services on top of some graph-based structure. You can have a graph database with no structure at all. Create nodes called ‘sky’ and ‘banana’, stick an edge between them and that could be valid in a graph database. There is not necessarily structure imposed.
Knowledge graph: definition
There is no agreed-upon definition of a knowledge graph. Wikipedia points to the Google product in their knowledge graph page. Sometimes the term refers to any labeled, directed graph structure. Many researchers have taken a stab at defining the term, and one of the definitions is beginning to stick. The Semantic Web Journal defines a knowledge graph with the following characteristics:
- A Knowledge Graph is a semantic graph-based manner in which to represent data.
- Graphs are relationship-first.
- The meaning of the graph is encoded in its structure.
Knowledge graphs impose the schema on top of any graph based structure. When you enforce an OWL ontology on top of an RDF data store, that is a knowledge graph. If you place a schema on top of a graph database and enforce it, that is a knowledge graph. The actual data storage doesn’t matter too much (it still needs to be a graph). What is important is the ontology layer encapsulating domain structure on top of the data.
The third point states that the ‘meaning of the graph is encoded in its structure’—essentially that the metadata coexists with the data itself. It’s the ability to be able to connect that ‘Sky’ node to the ‘Geographical feature’ one. This is important to allow the schema itself to be flexible. If the schema itself is data, it can be linked, removed and updated in the future. You can see that OWL and RDF fulfil this definition.
The Google Knowledge Graph
Google’s Knowledge Graph is probably the best existing manifestation of the semantic web. Just one point to clarify: Google has an internal product called the Knowledge Graph. We use it every time we do a Google search. The Google Knowledge Graph is not the only knowledge graph in existence, nor did they invent it. It was born out of the acquisition of Freebase, a huge graph of all knowledge that had been under development since the early 2000’s.
Today the Google Knowledge Graph powers part of every search that we do. When you google ‘Tim Berners Lee’ and get the rectangle on the right showing you the age, picture, and basic info about Tim Berners Lee, that was the Knowledge Graph search. It is what allows Google to show us actual entities and demonstrates the power of properly structured data.
Knowledge Graphs can seem difficult, but are becoming more and more accessible. Companies like Babylon Health, Benevolent.AI and NASA are using knowledge graphs in production. There are also plenty of platforms that will help you get started- each supports a different standard or representation framework.
Terminology is not only confusing, it is exclusionary. Clearly communicating is one of the primary roles every software engineer should develop.