Summary of “CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text” (4 Sep 2019)

The key takeaway

The authors introduce CLUTRR (Compositional Language Understanding and Text-based Relational Reasoning), a benchmark dataset generator to test a model’s ability to learn logical rules and identify entities and relationships in a text. CLUTRR generates short stories involving various family members (parents, grandparents, aunts and uncles, etc.) and the goal of the model is to infer the relationships that haven’t been mentioned explicitly. For example, given a story where it’s mentioned that X is the mother of Y, and Y is the father of Z, the model should be able to infer that X is the grandmother of Z.

Various models are tested on CLUTRR-generated datasets. The results show that Natural Language Understanding models still lack the ability to perform strongly on generalisation and logical reasoning tasks. This is especially true when compared to Graph Neural Networks, which work on structured, symbolic input and perform much better than text-based models.

The low-down

Dataset generation ^

The process of generating datasets follows 4 steps, explained below.

  1. Generate a random kinship graph satisfying a set of logical rules
    Note: Here we call an atom a predicate+entities unity such as [grandfatherOf,X,Y], which indicates that X is the grandfather of Y.
    A logical rule is given in the form Head ⊢ Body, for example [grandfatherOf,X,Y][[fatherOf,X,Z],[fatherOf,Z,Y]] 

    We’ll call R the set of all rules that govern family relationships (as defined by the authors of the paper). Some examples of rules:

    A kinship graph is generated by randomly sampling a set of entities and relationships to form a backbone graph, and then completing that graph by applying all the relevant rules in R. The result is a graph that defines a set of entities and all relationships between those entities
  2. Sample a target fact to predict
  3. Apply backward chaining to sample a set of k facts from the graph which are enough to infer the target fact
  4. Convert the sampled facts into a natural language story
    First, Amazon Mechanical Turk (AMT) is used to get crowd-workers to turn the set of sampled facts (with k = 1,2,3) into a short story, for example:

    The entities in the stories obtained are then replaced with placeholders to create a set of story templates. These templates can be used and combined in order to obtain stories of varying lengths (even with k > 3).

Types of possible tests ^

The task is set up as a classification problem: given a story and two input entities, the model needs to classify the relationship between the two entities.
Variations of datasets can be generated in order to test a model under different conditions. Specifically, we can test it to check the performance on systematic generalisation and robust reasoning.

A) Systematic generalisation 

  1. Linguistic generalisation:  hold out some of the story templates during training and use them at test time, to see if the model can generalise on previously unseen text.
  2. Logical generalisation: during training, show the model all the rules but not all possible combinations of rules.
  3. Length of required reasoning: train on stories with a certain number of facts, but test on stories with more facts (so more steps of reasoning)

B) Robust reasoning

We can test a model’s robustness to the addition of noise, ie. facts that are not required to answer the query. We can add three different types of noise.

  1. Supporting facts: facts which could also be used to answer the query, but require more steps
  2. Irrelevant facts: facts that relate to one of the entities in the story but are useless to answer the query
  3. Disconnected facts: facts that have nothing to do with the main story and entities

Experiments ^

The authors used CLUTRR to evaluate several different models on the different settings described earlier.

Models tested:

  • Bidirectional LSTMs (with and without attention)
  • Relational Networks
  • Compositional Memory Attention Network 
  • BERT
  • BERT with a trainable LSTM encoder on top of the pre-trained embeddings
  • Graph Attention Network (GAT), which receives the actual graph representation, not the natural language story

The models are used to obtain an embedding for the story, which is concatenated with the embeddings of the two entities and fed through a two-layer feed-forward neural net to obtain the output class (the predicted relationship between the two entities).


For Systematic Generalization:

For Robust Reasoning:

Things to note from these results:

  • Unsurprisingly, GAT performs clearly better than the text based models on almost all tasks
  • When evaluating robust reasoning, the text based models actually tend to perform better when supporting or irrelevant facts are added.
  • GAT performs worse when it is trained on clean samples and supporting/irrelevant facts are then added at test time, possibly because this introduces cycles/branches in the graph that weren’t present during training