Summary of "A Review of Relational Machine Learning for Knowledge Graphs" (28 Sep 2015)

Also published on Medium (as part of the publication Towards Data Science)

## The key takeaway

Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). Given a knowledge graph, statistical models can be used to expand and complete it by inferring missing facts. As a trivial example: given the two facts "Chiara was born in Turin" and "Turin is located in Italy" we can infer the fact "Chiara was born in Italy".

## The low-down

This paper gives an overview of Knowledge Graphs and some of the methods used to build and expand them. I will summarise and explain some of the key points:

### The basics of Knowledge Graphs (KG) ^

Knowledge Graphs (KG) are a way of structuring information in graph form, by representing entities (eg: people, places, objects) as nodes, and relationships between entities (eg: being married to, being located in) as edges. Facts are typically represented as "SPO" triples: (Subject, Predicate, Object). Essentially, two nodes connected by a relationship form a fact. For example, a fact in the image above could be: "Spock is a character in Star Trek". This fact is formed by the two nodes Spock and Star Trek and the relation characterIn, forming the SPO triple (Spock, characterIn, Star Trek).

So, existing edges indicate known facts. What about missing edges?
There are two possibilities:

• Closed World Assumption (CWA): non-existing triples/edges indicate false relationships: "the fact that in Figure 1 there is no starredIn edge from Leonard Nimoy to Star Wars is interpreted to mean that Nimoy definitely did not star in this movie"
• Open World Assumption (OWA): non-existing triples/edges simply represent unknowns: since there is no starredIn edge from Leonard Nimoy to Star Wars, we don't know whether Nimoy starred in Star Wars or not

KG typically include various types of hierarchies ("Leonard Nimoy is an actor, which is a person, which is a living thing") and constraints ("a person can only marry another person, not a thing").

#### Ways of building Knowledge Graphs:

• Manually by experts or volunteers
• By automatically extracting them from semi-structured text (eg: Wikipedia infoboxes)
• By automatically extracting them from unstructured text (using natural language processing techniques)

#### Main tasks in Knowledge Graph curation:

• Link prediction: predicting missing edges in the graph (ie: missing facts)
• Entity resolution: finding different nodes and/or different edges that actually refer to the same thing. For example, a system may contain triples such as (Obama, bornIn, Hawaii) and (Barack Obama, placeOfBirth, Honolulu). We may want to merge the Obama and Barack Obama nodes as they are probably referring to the same entity.
• Link-based clustering: grouping entities based on the similarity of their links

### Statistical Relational Learning (SRL) for KG ^

Assumption: all entities and types of relationships in a graph are known (there are \text{N}_e entities and \text{N}_r types of relationships). However, triples are incomplete: that is, some of the nodes in the graph are connected, but there are also pairs of nodes that should be connected but aren't. This translates to: there are a certain number of true facts, but we only know a subset of them. There could also be duplicates of entities and relationships.

Let's call e_i the subject node (Spock), e_j the object node (Star Trek), and r_k the relationship type (characterIn). We can now model each possible triple x_{ijk} = (e_i,r_k,e_j) as a binary random variable y_{ijk} \in \{0,1\}. Here, y_{ijk} is 1 if the triple exists and 0 otherwise. In the closed world assumption a 0 indicates a false triple, while in open world it indicates an unknown. These random variables are correlated with each other, since the presence of certain triples can predict the presence/absence of other triples.

We can group all possible triples in a third-order tensor \textbf{\underline{Y}} \in \{ 0,1\}^{N_e \times N_e \times N_r}, see image below.

Each possible realisation of \textbf{\underline{Y}} is a possible "world", a certain combination of facts. We want to figure out which of these realisations is the most likely to be accurate given the limited number of triples that we know are true. To do so, we need to estimate the distribution \text{P}\textbf{(\underline{Y})} from a subset \mathcal{D} of \text{N}_d observed triples. \textbf{\underline{Y}}can be incredibly large, so this task can be very complex. For example, Freebase had "over 40 million entities and 35,000 relations", giving 10^{19} possible triples.

However, only a small subset of these triples will actually be feasible, due to certain constraints. For example, we know that the relation marriedTo can only link two nodes that refer to people, so we can already exclude all the triples (e_i,r_\text{marriedTo},e_j) where one or both of the entities are not people. Ideally, we'd like to find a way to easily identify and discard all these "impossible" triples.

#### Statistical properties of Knowledge Graphs

• As already seen, KGs usually follow a set of deterministic rules, such as:
• Type constraints: the relation marriedTo can only refer to a person
• Transitivity: if A is located in B and B located in C, then A is located in C
• They often also loosely follow a set of statistical patterns:
• Homophily (or "autocorrelation"): entities tend to be related to entities with similar characteristics
• Block structure: some entities can be grouped into "blocks" so that all members of one block have similar relationships to members of another block

#### Types of SRL models

The paper covers 3 main types of Statistical Relational Learning models:

1. Latent feature models: we assume all y_{ijk} are independent given some latent features and additional parameters
2. Graph feature models: we assume all y_{ijk} are independent given observed graph features and additional parameters
3. Markov Random Fields: we assume all y_{ijk} have local interactions [not covered in this summary]

Latent feature models and graph feature models predict the existence of a triple x_{ijk} using a scoring function f(x_{ijk};\Theta), where \Theta is some set of parameters.

### Latent Feature Models ^

In Latent Feature Models we explain triples through latent variables. For example we can explain the fact "Alec Guinness received the Academy Award" with the latent variable "Alec Guinness is a good actor".

Given an entity e_i we express its latent features with a vector \bm{e}_i \in \mathbb{R}^{H_e}. For example, let's say we have two latent features (H_e = 2): being a good actor, and being a prestigious award. We could express the latent feature vectors for the entities AlecGuinness and AcademyAward as follows:

\bm{e}_{\text{AlecGuinness}}=\begin{bmatrix} 0.9 \\ 0.2 \end{bmatrix} \bm{e}_{\text{AcademyAward}}=\begin{bmatrix} 0.2 \\ 0.8 \end{bmatrix}

where the first element of the latent vectors indicates "good actor" and the second indicates "prestigious award".

To predict triples, we somehow need to model the interactions between these latent variables. The paper reviews several methods that have been developed, I will summarise the main ones:

#### RESCAL

Perhaps the most intuitive, RESCAL "explains triples via pairwise interactions of latent features", so that the score of a triple is given by:

f_{ijk}^\text{RESCAL} := \textbf{e}_i^\intercal\textbf{W}_k\textbf{e}_j = \sum_{a=1}^{H_e} \sum_{b=1}^{H_e} w_{abk}e_{ia}e_{jb}

where "\textbf{W}_k \in \mathbb{R}^{H_e\times H_e} is a weight matrix whose entries w_{abk} specify how much the latent features a and b interact in the k-th relation". Following on the example of Alec Guinness winning the Academy Award, the k-th relation is receivedAward, and its weight matrix could be:

\textbf{W}_\text{receivedAward} = \begin{bmatrix} 0.1 & 0.9 \\ 0.1 & 0.1 \end{bmatrix}

where the top right element indicates the combination of "good actor" and "prestigious award" (using the structure of the latent vectors from before). This would model the latent fact that good actors receive prestigious awards.

Therefore, in RESCAL we have one latent vector for each entity and one weight matrix for each type of relation, so the parameters \Theta are:

\Theta=\{ \{\textbf{e}_i \}_{i=1}^{N_e}, \{\textbf{W}_k \}_{k=1}^{N_r} \}

Rewriting the scoring function
We can also write the scoring function f_{ijk}^\text{RESCAL} in a different way. As we'll see, this will be helpful later on to compare RESCAL to other methods.

f_{ijk}^\text{RESCAL} := \textbf{e}_i^\intercal\textbf{W}_k\textbf{e}_j = \sum_{a=1}^{H_e} \sum_{b=1}^{H_e} w_{abk}e_{ia}e_{jb}

We can see that we could rewrite this by first creating a vector containing all the combinations of the elements of the latent feature vectors (in the formula above, these are all the combinations e_{ia}e_{jb}). We can obtain this vector like so:

\phi_{ij}^\text{RESCAL} := \textbf{e}_i \otimes \textbf{e}_j

where \otimes is the Kronecker product of two vectors \textbf{a} \in \mathbb{R}^N and \textbf{b} \in \mathbb{R}^M, which gives a vector of dimensions NM:

\textbf{a} \otimes \textbf{b} = \begin{bmatrix} a_1\textbf{b} \\ \vdots \\ a_N\textbf{b}\end{bmatrix} = \begin{bmatrix} a_1b_1 \\ \vdots \\a_1b_M \\ \vdots \\a_Nb_1 \\ \vdots \\ a_Nb_M\end{bmatrix}

So, using the Kronecker product we can rewrite the scoring function:

f_{ijk}^\text{RESCAL} := \textbf{e}_i^\intercal\textbf{W}_k\textbf{e}_j = \sum_{a=1}^{H_e} \sum_{b=1}^{H_e} w_{abk}e_{ia}e_{jb} = \textbf{w}_k ^\intercal \phi_{ij}^\text{RESCAL}

where \textbf{w}_k = \text{vec} (\textbf{W}_k).

To summarise, we can rewrite the RESCAL model as follows:

f_{ijk}^\text{RESCAL} := \textbf{w}_k ^\intercal \phi_{ij}^\text{RESCAL}
\phi_{ij}^\text{RESCAL} := \textbf{e}_i \otimes \textbf{e}_j

Representing this visually (here, the size of the latent vectors is H_e = 3):

Parameters
So how many parameters does RESCAL require? We need to consider:

• The total number of latent features: we have N_e entities, each of which is represented by a latent feature vector of size H_e, so this gives us N_e H_e parameters
• The size of the vectors \textbf{w}_k: this has the same size as \textbf{e}_i \otimes \textbf{e}_j, so H_e^2
• The number of vectors \textbf{w}_k: this is simply the number of types of relations, so N_r.   So overall this gives N_r H_e^2 parameters for \textbf{w}_k

In total, this gives us N_e H_e + N_r H_e^2

#### Multi-Layer Perceptrons (MLP)

##### Entity Multi-Layer perceptrons (E-MLP)

We can also choose to model the interactions of latent features using a standard MLP with the latent vectors of the subject and object entities as input. So we'll have an input layer which takes the concatenation of the latent vectors:

\phi_{ij}^\text{E-MLP} :=[\textbf{e}_i; \textbf{e}_j]

We then have a hidden layer \textbf{h}_{ijk}^a which models the interactions between the latent features through a matrix \textbf{A}_k:

\textbf{h}_{ijk}^a:=\textbf{A}_k^\intercal \phi_{ij}^\text{E-MLP}

Notice how RESCAL always considers all possible interactions of the latent features through the product \textbf{e}_i \otimes \textbf{e}_j, while the E-MLP model learns these interactions through \textbf{A}_k. Finally, we output the score:

f_{ijk}^\text{E-MLP} :=\textbf{w}_k^\intercal \textbf{g}(\textbf{h}_{ijk}^a)

where \textbf{g}(\cdot) is some activation function.

We can represent this visually (here, the size of the latent vectors is H_e = 3 and the size of the hidden layer is H_a = 2):

Parameters
Just as we've done for RESCAL, we'd like to know how many parameters this model requires:

• Just like with RESCAL, the total number of parameters for the latent features is N_e H_e
• The matrix \textbf{A}_k takes the concatenation of two latent vectors and transforms it to the size of the hidden layer, so it has size H_a \times 2H_e
• \textbf{w}_k has size H_a
• We need one \textbf{A}_k and one \textbf{w}_k for each relation, so this gives us N_r(H_a \times 2H_e + H_a) parameters

In total, this gives N_e H_e + N_r(H_a \times 2H_e + H_a). Depending on our choice of H_a, this could reduce the number of parameters needed compared to RESCAL.

##### Entity-Relation Multi-Layer perceptrons (ER-MLP)

How can we further reduce the number of parameters needed?

We can embed the latent vector of the relation with a vector \textbf{r}_k \in \mathbb{R}^{H_r} and concatenate this alongside the subject and object vectors in the input:

\phi_{ij}^\text{ER-MLP} :=[\textbf{e}_i; \textbf{e}_j; \textbf{r}_k]

We then build our hidden layer:

\textbf{h}_{ijk}^c:=\textbf{C}^\intercal \phi_{ij}^\text{ER-MLP}

And our output layer:

f_{ijk}^\text{ER-MLP} :=\textbf{w}^\intercal \textbf{g}(\textbf{h}_{ijk}^c)

Notice how by embedding the relation as well, we now have a global matrix \textbf{C} instead of the k-dependent \textbf{A}_k, and a global vector \textbf{w} instead of \textbf{w}_k. This can drastically reduce the number of parameters needed.

We can represent this visually (here the latent vectors have size H_e = 3 and H_r = 3, and the hidden layer has size H_c = 3):

Parameters
Let's count the parameters:

• The total number of parameters for the latent features is N_e H_e + N_r H_r
• The number of parameters for \textbf{C} is H_c \times (2H_e+H_r)
• The number of parameters for \textbf{w} is H_c

In total, we have N_e H_e + N_r H_r +H_c \times (2H_e+H_r) + H_c

Comparing the structures of these three different models:

#### Latent Distance Models

In these models the probability of there being a relationship between two entities is given by the distance of their latent representations. The closer these representations are, the more likely the two entities are to be in a relationship. For uni-relational data (so when there is only one kind of relationship), this is fairly straightforward: we can model the probability of a pairwise relationship x_{ij} via a score function

f_{ij} = -\text{d}(\textbf{e}_i, \textbf{e}_j)

where \text{d}(\cdot, \cdot) is some distance measure (eg: the Euclidean distance).

How can we extend this to multi-relational data?
A possible solution is the Structured Embedding (SE) model, which uses the following scoring function for the triple x_{ijk}:

f_{ijk}^\text{SE} := - \|\textbf{A}_k^s \textbf{e}_i - \textbf{A}_k^o\textbf{e}_j \|_1

So the latent features are transformed through the matrices \textbf{A}_k^s and \textbf{A}_k^o and compared. These matrices are learned "in a way such that pairs of entities in existing relationships are closer to each other than entities in non-existing relationships"

### Graph Feature Models ^

Latent Feature Models use latent variables to predict new links in the graph. In contrast, Graph Feature Models make their predictions directly from the observed triples.

Some types of Graph Feature Models:

• Similarity measures for uni-relational data: the idea here is to use some measure of similarity to predict a link between two nodes, since similar entities are likely to be related. This similarity can be derived in various ways, for example by looking at the neighbours of the nodes (eg: "Do these 2 nodes have many common neighbours?")
• Rule Mining and Inductive Logic Programming: if we can extract some logical rules from the observed triples, we can use them to predict new triples in the graph. This can make the model highly interpretable as it is simply given by a set of rules. However, it is not easy learn all rules and patterns
• Path Ranking Algorithm, explored below

#### Path Ranking Algorithm (PRA)

The Path Ranking Algorithm essentially finds paths in the graph that are useful in predicting certain new edges. For example, let's say there are two nodes e_i and e_j, both of which receive incoming links of type bossOf from a third node. Then PRA might learn that there is a high probability of e_i and e_j being connected by the edge colleagueOf.

More generally, we're interested in building a model that can use longer paths between two nodes to predict a certain direct edge connecting them. Can the longer paths connecting ei and ej help us predict the existence of a direct relation rk between them?

We can use a logistic regression model to build a scoring function1I'll be using a slightly different notation from that used in the paper for the path features. I find this clearer and more consistent with how the Path Ranking Algorithm is used in other papers.

f_{ijk}^\text{PRA}:=\textbf{w}_k^\intercal \phi_{ij}^\text{PRA}

where \phi_{ij}^\text{PRA} are the features corresponding to all possible paths between e_i and e_j. So we now need to find some way of expressing these various paths in a numerical, quantifiable form that we can use as \phi_{ij}^\text{PRA}.

Let \pi=\langle r_1, r_2,…,r_L\rangle be a path type of length L, ie: a determined sequence of edge types . How can we express the path between e_i and e_j which follows path type \pi as a feature? We could use the probability that we'll end up in e_j if we start from e_i and follow a path adhering strictly to the edge types defined by \pi, assuming that at each junction we'll pick one of the possible outgoing links uniformly at random. We can express this probability as \text{P}(i \rightarrow j,\pi).

For example, let's define a path type \pi=\langle \text{friendOf, parentOf } \rangle. If we look at the graph above we can see that \text{P}(i \rightarrow j,\pi) = 0.5. In fact, starting from e_i there is only one link that satisfies the relation friendOf, so we'll get to the node e_b with a probability of 1. However, from e_b there are two possible parentOf links, so from here we'll get to e_j with a probability of 1/2.

If we call \Pi = \{ \pi_1, \pi_2, ..., \pi_n \} the set of all path types that we want to consider, we can define our feature vector as:

\phi_{ij}^\text{PRA}:=[\text{P}(i \rightarrow j,\pi): \pi \in \Pi]

and our scoring function will be:

f_{ijk}^\text{PRA}:=\textbf{w}_k^\intercal \phi_{ij}^\text{PRA} = \sum_{\pi \in \Pi} w_k^{\pi} \text{P}(i \rightarrow j,\pi)

An advantage of using this model is that it is easily interpretable. In fact, we can look at the path types that get the highest weights and see them as "rules" that the model has identified. The paper makes the example of the weights learned to predict triples (p, college, c), so predicting which college a person has attended. One of the highest weights was the one for the path type (draftedBy, school), meaning that if a person was drafted by a team belonging to a certain school (college), then they likely went to that college.

### Combining Latent and Graph Feature Models ^

Latent and Graph feature models have different strengths:

• Latent feature models work well for modelling global relational patterns and when only a small number of latent variables are sufficient to explain triples
• Graph feature models work well for modelling local relational patterns and when triples can be explained using the neighbourhood of entities or short paths in the graph

Therefore, it's useful to combine these two types of methods to leverage their respective strengths. A possible way to do this is through Additive Relational Effects (ARE) models. For example we can combine RESCAL and PRA:

f_{ijk}^\text{RESCAL+PRA}:=\textbf{w}_k^{(1)\intercal} \phi_{ij}^\text{RESCAL} + \textbf{w}_k^{(2)\intercal} \phi_{ij}^\text{PRA}

This way PRA can model the observable graph patterns and RESCAL can model the "residual errors" that cannot be modelled by PRA. This means that RESCAL will need a much smaller number of latent features than if it had to model everything by itself.

### Some more cool stuff ^

I will leave you with some potential topics to look up to expand on what I've covered in this post:

• Markov Random Fields: another type of Statistical Relational Learning model worth looking into. It's covered in the paper, though if you've never seen them before I would recommend finding some more beginner-friendly resources
• Higher-arity relations: this paper was focused on binary relations (and I believe this is the default for knowledge graphs), but it is possible to build graphs where the relationships involve more than two terms
• What about time? Some facts are only true at a certain moment in time or during a certain time interval, how can we model this?

 Nickel, Maximilian et al. A Review of Relational Machine Learning for Knowledge Graphs. (2016). Proceedings of the IEEE 104.1 (2016): 11–33.
 Gardner, M., Talukdar, P., Kisiel, B. and Mitchell, T. Improving learning and inference in a large knowledge-base using latent syntactic cues (2013). Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 833-838).

1. • 