Chapter 15 - Knowledge Graphs
Contents
Chapter 15 - Knowledge Graphs#
Authors: John Erol Evangelista
Maintainers: John Erol Evangelista
Version: 0.1
License: CC-BY-NC-SA 4.0
In this section we’ll take a look at knowledge graphs, a popular tool for visualizing and querying the interconnections between multiple entities thus allowing us to discover hidden connections between entities.
Introduction to Databases#
When thinking about the purpose of a database, the first thing that comes to mind is a table of items with a bunch of properties. For example, items in a grocery store could be stored in a database where each record is an item that is up for sale. Such an item has properties such as price, expiration date, quantity, and supplier. Such data can be stored in a table where each item is considered a record in the database. What if you also want to have a table that lists all the suppliers? Such a table may have information about the supplier’s name, website, e-mail, and phone number. Each supplier may also have their own “items” tables listing all the items each supplier sells. If all these tables are organized into a database, we can query the database with questions such as “which supplier sells the cheapest oranges?”, or execute a command such as “if the count of garlic is below 10, order 50 containers from supplier x”. Databases make it easy to keep track of data stored in different related tables, and this makes it easy to query the database to draw information from the multiple tables it contains.
Item |
Price |
Expiration Date |
Quantity |
Supplier |
---|---|---|---|---|
Garlic |
1.79 |
01-23 |
100 |
Veggie Inc |
Onion |
0.99 |
01-23 |
65 |
Veggie Inc |
Cabbage |
1.59 |
12-22 |
50 |
Cabbage Inc |
Orange |
0.99 |
12-22 |
47 |
Cabbage Inc |
The most commonly used databases are considered relational databases. Relational databases store information in tables and provide a query language, namely SQL, to create tables, populate these tables with data, update the data in such tables, and query the data for answers to questions such as “which supplier sells the cheapest oranges?”. This standardized language enables creating, maintaining, and manipulating databases with relative ease. Some of the more popular open source free RDB providers are PostgreSQL, MySQL, MariaDB, which is a fork of MySQL, and SQLite, a lightweight RDB that stores data as files.
The tables below show an example of a relational database that catalogs authors of publications and their collaborations. To store the information in a database, we define three tables: Authors (with the author IDs, names, and institutions) (Table 2); publications (with the paper ID, title, authors, and journal) (Table 3 ); and collaborations (which links two authors and a paper via their respective IDs) (Table 4).
ID |
Name |
Institution |
---|---|---|
1 |
Ryu |
JBU |
2 |
Lana |
JBU |
3 |
Lei |
ZMU |
4 |
Jessie |
YC |
5 |
Jim |
DHU |
ID |
Paper |
Journal |
---|---|---|
1 |
Paper 1 |
Nature |
2 |
Paper 2 |
NAR |
3 |
Paper 3 |
NAR |
Author 1 |
Author 2 |
Paper ID |
---|---|---|
1 |
2 |
2 |
1 |
3 |
2 |
2 |
3 |
2 |
2 |
4 |
1 |
2 |
5 |
1 |
4 |
5 |
1 |
3 |
5 |
3 |
We could otherwise represent the three tables above into one big table line in Table 5 but we can see the issues in data redundancies. Table 2-Table 4 are normalized tables. Normalization is a step done to reduce the issue of data redundancy in relational databases to improve data integrity. In general, we want to avoid many-to-many relationships because these make querying the database problematic. While normalizing a database solves some issues, it may lead to complex queries when joining multiple tables to fetch search results.
Author 1 |
Author 1 Institution |
Author 2 |
Author 2 Institution |
Paper |
Journal |
---|---|---|---|---|---|
Ryu |
JBU |
Lana |
JBU |
Paper 2 |
NAR |
Ryu |
JBU |
Lei |
ZMU |
Paper 2 |
NAR |
Lana |
JBU |
Lei |
ZMU |
Paper 2 |
NAR |
Lana |
JBU |
Jessie |
YC |
Paper 1 |
Nature |
For example, if we want to find which authors have collaborated with Jessie and Lei. Let’s try to do this step-by-step. First, we join the Authors and Collaborations tables based on the Author ID. Once we get the relevant collaboration rows, we can refer back to the Authors table to get the names of those that collaborated with them. We see that both Lana and Jim have collaborated with them. Can you try and identify which papers they have written together? Note that using JOINS can be expensive. This example is a simple demonstration but you can imagine doing this across multiple large tables. The computational cost, as well as the complexity of the query can become a hurdle. To address this, we look at graph databases which are designed to structure the data based on their connections.