Natural language inference

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2016 term"


Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

Our NLI data will look like this:

  • (every dog danced, every puppy moved) $\Rightarrow$ entailment
  • (a puppy danced, no dog moved) $\Rightarrow$ contradiction
  • (a dog moved, no puppy danced) $\Rightarrow$ neutral

The first sentence is the premise and the second is the hypothesis (logicians call it the conclusion).

We looked at NLI briefly in our word-level entailment bake-off (the wordentail.ipynb notebook). The purpose of this codebook is to introduce the problem of NLI more fully in the context of the Stanford Natural Language Inference corpus (SNLI). We'll explore two general approaches:

  • Standard linear classifiers
  • Recurrent neural networks

This should be a good starting point for exploring richer models of NLI. It's also fun because it sets up a battle royale between models that require serious linguistic analysis (the linear ones) and models that are claimed by advocates to require no such analysis (deep learning).

In [2]:
import os
import re
import sys
import pickle
import numpy as np
import itertools
from collections import Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import utils
from nltk.tree import Tree
from nli_rnn import ClassifierRNN
In [3]:
%config InlineBackend.figure_formats=['svg']


  1. Make sure your environment includes all the requirements for the cs224u repository. It's okay if you couldn't get TensorFlow to work – it's not required for this notebook.

  2. Dowbload the nli-data data distribution and put it in the same directory as this notebook (or update snli_sample_src just below.

  3. For the homework: make sure you've run to get the NLTK data. (In particular, you need to use NLTK's WordNet API.)

In [4]:
# Home for our SNLI sample. Because SNLI is very large, we'll work with a 
# small sample from the training set in class.
snli_sample_src = os.path.join('nli-data', 'snli_1.0_cs224u_sample.pickle')

# Load the dataset: a dict with keys `train`, `dev`, and `vocab`. The first
# two are lists of `dict`s sampled from the SNLI JSONL files. The third is
# the complete vocabulary of the leaves in the trees for `train` and `dev`.
snli_sample = pickle.load(open(snli_sample_src, 'rb'))

dict_keys(['dev', 'vocab', 'train'])

Working with SNLI

SNLI contains both regular string representations of the data, unlabeled binary parses like the following:

( ( A child ) ( is ( playing ( in ( a yard ) ) ) ) )

and labeled binary parses like

(ROOT (S (NP (DT A) (NN child)) (VP (VBZ is) (VP (VBG playing) (PP (IN in) (NP (DT a) (NN yard))))) (. .)))

Here are the class labels that we wish to learn to predict:

In [5]:
LABELS = ['contradiction', 'entailment', 'neutral']

The training set for SNLI contains 550,152 sentence pairs, with sentences varying in length from 2 to 62 words. This is too large for in-class experiments and assignments. This is why we're working with the sample in snli_sample:

In [6]:
In [7]:
In [8]:

Both train and test are balanced across the three classes, with sentences varying in length from 3 to 6 words. These limitations will allow us to explore lots of different models in class. You're encouraged to try out your ideas on the full dataset outside of class (perhaps as part of your final project).


The following function can be used to turn bracketed strings like the above into trees:

In [9]:
def str2tree(s):
    """Map str `s` to an `nltk.tree.Tree` instance. The assumption is that 
    `s` represents a standard Penn-style tree."""
    return Tree.fromstring(s)    
In [10]:
t = str2tree("""(ROOT
    (NP (DT A) (NN child))
    (VP (VBZ is)
      (VP (VBG playing)
        (PP (IN in)
          (NP (DT a) (NN yard)))))
    (. .)))""")


For baseline models, we often want just the words, also called terminal nodes or leaves. We can access them with the leaves method on nltk.tree.Tree instances:

In [11]:
['A', 'child', 'is', 'playing', 'in', 'a', 'yard', '.']


To make it easy to run through the corpus, let's define general readers for the data. The general function for this yields triples consisting of the the left tree and the right tree, as parsed by str2tree, and finally the label:

In [12]:
def snli_reader(sample):
    """Reader for SNLI data. `sample` just needs to be an iterator over
    the SNLI JSONL files. For this notebook, it will always be 
    `snli_sample`, but, for example, the following should work for the 
    corpus files:
    import json    
    def sample(src_filename):
        for line in open(src_filename):
            yield json.loads(line)
        (tree1, tree2, label), where the trees are from `str2tree` and
        label is in `LABELS` above.
    for d in sample:
        yield (str2tree(d['sentence1_parse']), 
def train_reader():
    """Convenience function for reading just the training data."""
    return snli_reader(snli_sample['train'])

def dev_reader():
    """Convenience function for reading just the dev data."""
    return snli_reader(snli_sample['dev'])

Linear classifier approach

To start, we'll adopt an approach that is essentially identical to that of the supervisedsentiment.ipynb notebook: we'll train simple MaxEnt classifiers on representations of the data obtained from hand-built feature functions.

This notebook defines some common baseline features based on pairings of information in the premise and hypothesis. As usual, one can realize big performance gains quickly by improving on these baseline representations.

Baseline linear classifier features

The first baseline we define is the word overlap baseline. It simply uses as features the words that appear in both sentences.

In [13]:
def word_overlap_phi(t1, t2):    
    """Basis for features for the words in both the premise and hypothesis.
    This tends to produce very sparse representations.
    t1, t2 : `nltk.tree.Tree`
        As given by `str2tree`.
       Maps each word in both `t1` and `t2` to 1.
    overlap = set([w1 for w1 in t1.leaves() if w1 in t2.leaves()])
    return Counter(overlap)

Another popular baseline is the full cross-product of words from both sentences:

In [14]:
def word_cross_product_phi(t1, t2):
    """Basis for cross-product features. This tends to produce pretty 
    dense representations.
    t1, t2 : `nltk.tree.Tree`
        As given by `str2tree`.
        Maps each (w1, w2) in the cross-product of `t1.leaves()` and 
        `t2.leaves()` to its count. This is a multi-set cross-product
        (repetitions matter).
    return Counter([(w1, w2) for w1, w2 in itertools.product(t1.leaves(), t2.leaves())])

Both of these feature functions return count dictionaries mapping feature names to the number of times they occur in the data. This is the representation we'll work with throughout; sklearn will handle the further processing it needs to build linear classifiers.

Naturally, you can do better than these feature functions! Both of these might be useful even in a more advanced model, though.

Building datasets for linear classifier experiments

As usual, the first step in training a classifier is using a feature function like the one above to turn the data into a list of training instances (feature representations and their associated labels):

In [15]:
def build_linear_classifier_dataset(
    """Create a dataset for training classifiers using `sklearn`.
        An SNLI iterator like `snli_reader` above. Just needs to
        yield (tree, tree, label) triples.
    phi : feature function
        Maps trees to count dictionaries.
    vectorizer : `sklearn.feature_extraction.DictVectorizer`   
        If this is None, then a new `DictVectorizer` is created and
        used to turn the list of dicts created by `phi` into a 
        feature matrix. This happens when we are training.
        If this is not None, then it's assumed to be a `DictVectorizer` 
        and used to transform the list of dicts. This happens in 
        assessment, when we take in new instances and need to 
        featurize them as we did in training.
        A dict with keys 'X' (the feature matrix), 'y' (the list of
        labels), 'vectorizer' (the `DictVectorizer`), and 
        'raw_examples' (the original tree pairs, for error analysis).
    feat_dicts = []
    labels = []
    raw_examples = []
    for t1, t2, label in reader():
        d = phi(t1, t2)
        raw_examples.append((t1, t2))
    if vectorizer == None:
        vectorizer = DictVectorizer(sparse=True)
        feat_matrix = vectorizer.fit_transform(feat_dicts)
        feat_matrix = vectorizer.transform(feat_dicts)
    return {'X': feat_matrix, 
            'y': labels, 
            'vectorizer': vectorizer, 
            'raw_examples': raw_examples}

Training linear classifiers

To keep this notebook relatively simple, we adopt a bare-bones training framework, using just a standard-issue MaxEnt classifier. The following function is from supervisedsentiment.ipynb:

In [16]:
def fit_maxent_classifier(X, y):    
    """Wrapper for `sklearn.linear.model.LogisticRegression`. This is also 
    called a Maximum Entropy (MaxEnt) Classifier, which is more fitting 
    for the multiclass case.
    X : 2d np.array
        The matrix of features, one example per row.
    y : list
        The list of labels for rows in `X`.
        A trained `LogisticRegression` instance.
    mod = LogisticRegression(fit_intercept=True), y)
    return mod

For a more robust and responsible approach, see supervisedsentiment.ipynb notebook, especially the section on hyperparameter search.

Running linear classifier experiments

The linear_classifier_experiment function handles the book-keeping associated with running experiments. It essentially just combines all of the above pieces in a flexible way. If you decide to expand this codebase for real experiments, then you'll likely want to incorporate more of the functionality from the supervisedsentiment.ipynb notebook, especially its method for comparing different models statistically.

In [17]:
def linear_classifier_experiment(
    """Runs experiments on our SNLI fragment.
    train_reader, assess_reader
        SNLI iterators like `snli_reader` above. Just needs to
        yield (tree, tree, label) triples.
    phi : feature function (default: `word_overlap_phi`)
        Maps trees to count dictionaries.
    train_func : model wrapper (default: `fit_maxent_classifier`)
        Any function that takes a feature matrix and a label list
        as its values and returns a fitted model with a `predict`
        function that operates on feature matrices.
        A formatted `classification_report` from `sklearn`.
    train = build_linear_classifier_dataset(train_reader, phi)    
    assess = build_linear_classifier_dataset(assess_reader, phi, vectorizer=train['vectorizer'])
    mod = fit_maxent_classifier(train['X'], train['y'])
    predictions = mod.predict(assess['X'])
    return classification_report(assess['y'], predictions)
In [18]:
             precision    recall  f1-score   support

contradiction       0.41      0.58      0.48      1000
 entailment       0.45      0.34      0.39      1000
    neutral       0.35      0.29      0.32      1000

avg / total       0.40      0.40      0.40      3000

In [19]:
             precision    recall  f1-score   support

contradiction       0.63      0.58      0.60      1000
 entailment       0.55      0.63      0.59      1000
    neutral       0.56      0.53      0.54      1000

avg / total       0.58      0.58      0.58      3000

A few ideas for better classifier features

  • Cross product of synsets compatible with each word, as given by WordNet. (Here is a codebook on using WordNet from NLTK to do things like this.)

  • More fine-grained WordNet features — e.g., spotting pairs like puppy/dog across the two sentences.

  • Use of other WordNet relations (see Table 1 and Table 2 in this codelab for relations and their coverage).

  • Using the tree structure to define features that are sensitive to how negation scopes over constituents.

  • Features that are sensitive to differences in negation between the two sentences.

  • Sentiment features seeking to identify contrasting sentiment polarity.

Recurrent neural network approach

Very recently, recurrent neural networks (RNNs) have become one of the dominant approaches to NLI, and there is a great deal of interest in the extent to which they can learn to simulate the powerful symbolic approaches that have long dominated work in NLI.

The goal of this section is to give you some hands-on experience with using RNNs to build NLI models. Because these models are demanding not only in terms of data but also in terms of training time, we'll just get a glimpse of their potential, but I think even this glimpse clearly indicates their great potential.

Classifier RNN model definition

The model we'll be exploring is probably the simplest one that fits the NLI problem. It's depicted in the following diagram:

This model would actually work for any classification task. For instance, you could revisit the supervisedsentiment notebook and try it out on the Stanford Sentiment Treebank.

The dominant applications for RNNs to date have been for language modeling and machine translation. Those models have many more output vectors than ours. For a wonderful step-by-step introduction to such models, see Denny Britz's four-part tutorial (in the form of a notebook like this one). See also Andrej Karpathy's insightful, clear overview of different RNN architectures. (Both Denny and Andrej are Stanford researchers!)

The above diagram is a kind of schematic for the following model definition:

$$h_{t} = \tanh\left(x_{t}W_{xh} + h_{t-1}W_{hh}\right)$$$$y = \text{softmax}\left(h_{n}W_{hy} + b\right)$$

where $n$ is the sequence length and $1 \leqslant t \leqslant n$. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$. In our implementation, this is always an all $0$ vector, but it can be initialized in more sophisticated ways.

It's important to see that there is just one $W_{xh}$, just one $W_{hh}$, and just one $W_{hy}$.

Our from-scratch implementation of the above model is in As usual, the goal of this code is to illuminate the above concepts and clear up any lingering underspecification in descriptions like the above. The code also shows how backpropagation through time works in these models. You'll see that it is very similar to regular backpropagation as we used it in the simpler word-entailment bake-off (using the feed-forward networks from

Building datasets for classifier RNNs

The following function uses our snli_reader infrastructure to create datasets for training and assessing RNNs. The steps:

  • Concatenate the leaves of the premise and hypothesis trees into a sequence
  • Use the LABELS vector defined above to turn each string label into a one-hot vector.
In [20]:
def build_rnn_dataset(reader):
    """Build RNN datasets.
        SNLI iterator like `snli_reader` above. Just needs to
        yield (tree, tree, label) triples.
    list of tuples
        The first member of each tuple is a list of strings (the
        concatenated leaves) and the second is an np.array 
        (dimension 3) with a single 1 for the true class and 0s
        in the other two positions
    dataset = []
    for (t1, t2, label) in reader():
        seq = t1.leaves() + t2.leaves()
        y_ = np.zeros(3)
        y_[LABELS.index(label)] = 1.0
        dataset.append((seq, y_))
    return dataset

Running classifier RNN experiments

Nex we define functions for the training and assessment steps. It's currently baked in that you want to train with train_reader and assess on dev_reader. If you start doing serious experiments, you'll want to move to a more flexible set-up like the one we established above for linear classifiers (and see supervisedsentiment.ipynb for even more ideas).

The important thing to see about this function is that it requires a vocab argument and an embedding argument:

  • vocab is a list of strings. It needs to contain every word we'll encounter in training or assessment.
  • embedding is a 2d matrix in which the ith row gives the input representation for the ith member of vocab.

This gives you flexibility in how you represent the inputs. In the experiment run below, the inputs are just random vectors, but the homework asks you to try out GloVe inputs.

In [21]:
def rnn_experiment(
    """Classifier RNN experiments.
    vocab : list of str
        Must contain every word we'll encounter in training or assessment.
    embedding : np.array
        Embedding matrix for `vocab`. The ith row gives the input 
        representation for the ith member of vocab. Thus, `embedding`
        must have the same row count as the length of vocab. Its
        columns can be any length. (That is, the input word 
        representations can be any length.)
    hidden_dim : int (default: 10)
        Dimensionality of the hidden representations. This is a
        parameter to `ClassifierRNN`.
    eta : float (default: 0.05)
        The learning rate. This is a parameter to `ClassifierRNN`.       
    maxiter : int (default: 10)
        Maximum number of training epochs. This is a parameter 
        to `ClassifierRNN`.       
        A formatted `sklearn` `classification_report`.
    # Training:
    train = build_rnn_dataset(train_reader)       
    mod = ClassifierRNN(
    # Assessment:
    assess = build_rnn_dataset(dev_reader) 
    return rnn_model_evaluation(mod, assess)
In [22]:
def rnn_model_evaluation(mod, assess, labels=LABELS):
    """Asssess a trained `ClassifierRNN`.
    mod : `ClassifierRNN`
        Should be a model trained on data in the same format as
    assess : list
        A list of (seq, label) pairs, where seq is a sequence of
        words and label is a one-hot vector giving the label.        
    # Assessment:
    gold = []
    predictions = []    
    for seq, y_ in assess:
        # The gold labels are vectors. Get the index of the single 1
        # and look up its string in `LABELS`:
        # `predict` returns the index of the highest score.
        p = mod.predict(seq) 
    # Report:
    return classification_report(gold, predictions)

Here's an example run. All input and hidden dimensions are quite small, as is maxiter. This is just so you can run experiments quickly and see what happens. Nonetheless, the performance is competitive with the linear classifier above, which is encouraging about this approach.

In [23]:
vocab = snli_sample['vocab']

# Random embeddings of dimension 10:
randvec_embedding = np.array([utils.randvec(10) for w in vocab])

# A small network, trained for just a few epochs to see how things look:
print(rnn_experiment(vocab, randvec_embedding, hidden_dim=10, eta=0.001, maxiter=10))
Finished epoch 10 of 10; error is 1.0777186145
             precision    recall  f1-score   support

contradiction       0.35      0.38      0.36      1000
 entailment       0.41      0.52      0.46      1000
    neutral       0.39      0.26      0.31      1000

avg / total       0.38      0.39      0.38      3000

Next steps for NLI deep learning models

As noted above, ClassifierRNN is just about the simplest model we could use for this task. Some thoughts on where to take it:

  • Additional hidden layers can be added. This is a relatively simple change to the code: one just needs to define a version of $W_{hh}$ for each layer, respecting the desired dimensions for the representations of the layers it connects. The backpropagation steps are also straightforward duplications of what happens between the current layers.

  • ClassifierRNN uses the most basic (non-linear) activation functions. In TensorFlow, it is easy to try more advanced designs, including Long Short-Term Memory (LSTM) cells and Gated Recurrent Unit (GRU) cells. The documentation for these is currently a bit hard to find, but here's the well-documented source code.

  • Our implementation uses the same parameter $W_{hh}$ for the premise and hypothesis. It is common to split this into two, with the final hidden state from the premise providing the initial hidden state of the hypothesis.

  • The SNLI leaderboard shows the value of adding attention layers. These are additional connections between premise and hypothesis. They can be made for each pair of words or just for the final hidden representation in the premise and hypothesis.

  • Our implementation currently has only a single learning rate parameter. A well-tested improvement on this is the AdaGrad method, which can straightforwardly be added to the ClassifierRNN implementation.

  • Our implementation is regularized only in the sense that the number of iterations acts to control the size of the learned weights. Within deep learning, an increasingly common regularization strategy is drop-out.

  • We haven't made good use of trees. Like many linguists, I believe trees are necessary for capturing the nuanced ways in which we reason in language, and this new paper offers empirical evidence that trees are important for SNLI. Tree-structured neural networks are by now well-understood extensions of feed-forward neural networks and so are well within reach for a final project. The Stanford Deep Learning course site is a great place to get started.

Additional NLI resources

Homework 4

1. WordNet-based entailment features [4 points]

Python NLTK has an excellent WordNet interface. As noted above, WordNet is a natural choice for defining useful features in the context of NLI.

Your task: write and submit a feature function, for use with build_linear_classifier_dataset and linear_classifier_experiment, that is just like word_cross_product_phi except that, given a sentence pair $(S_{1}, S_{2})$, it counts only pairs $(w_{1}, w_{2})$ such that $w_{1}$ entails $w_{2}$, for $w_{1} \in S_{1}$ and $w_{2} \in S_{2}$. For example, the sentence pair (the cat runs, the animal moves) would create the dictionary {(cat, animal): 1.0, (runs, moves): 1.0}.

There are many ways to do this. For the purposes of the question, we can limit attention to the WordNet hypernym relation. The following illustrates reasonable ways to go from a string $s$ to the set of all hypernyms of Synsets consistent with $s$:

In [24]:
from nltk.corpus import wordnet as wn
puppies = wn.synsets('puppy')
print([h for ss in puppies for h in ss.hypernyms()])

# A more conservative approach uses just the first-listed 
# Synset, which should be the most frequent sense:
[Synset('dog.n.01'), Synset('pup.n.01'), Synset('young_person.n.01')]
[Synset('dog.n.01'), Synset('pup.n.01')]

A note on performance: in our experience, this feature function (used in isolation) gets a mean F1 of about 0.32. This is not very high, but that's perhaps not surprising given its sparsity.

2. Pretrained RNN inputs [2 points]

In the simple RNN experiment above, we used random input vectors. In the word-entailment bake-off, pretraining was clearly beneficial. What are the effects of using pretrained inputs here?


  1. A function build_glove_embedding that creates an embedding space for all of the words in snli_sample['vocab']. (You can use any GloVe file you like; the 50d one will be fastest.) See randvec_embedding above if you need further guidance on the nature of the data structure to produce. If you encounter any words in snli_sample['vocab'] that are not in GloVe, have your function map them instead to a random vector of the appropriate dimensionality (see utils.randvec).

  2. A function call for rnn_experiment using your GloVe embedding. (You can set the other parameters to rnn_experiment however you like.)

  3. The output of this function. (You won't be evaluated by how strong the performance is. We're just curious.)

You can use utils.glove2dict to read in the GloVe data into a dict.

A note on performance: your numbers will vary widely, depending on how you configure your network and how long you let it train. You will not be evaluated on the performance of your code, but rather only on whether your functions do their assigned jobs.

3. Learning negation [4 points]

The goal of this question is to begin to access the extent to which RNNs can learn to simulate compositional semantics: the way the meanings of words and phrases combine to form more complex meanings. We're going to do this with simulated data so that we have clear learning targets and so we can track the extent to which the models are truly generalizing in the desired ways.

Data and background

The base dataset is nli_simulated_data.pickle in this directory (the root folder of the cs224u repository). (You'll see below why it's the "base" dataset.)

In [25]:
simulated_data = pickle.load(open('nli_simulated_data.pickle', 'rb'))

This is a list of triples, where the first two members are lists and the third member is a label:

In [26]:
[([['a'], ['a']], 'equal'),
 ([['a'], ['c']], 'superset'),
 ([['a'], ['b']], 'neutral'),
 ([['a'], ['e']], 'superset'),
 ([['a'], ['d']], 'neutral')]

The letters are arbitrary names, but the dataset was generated in a way that ensures logical consistency. For instance, if (['x'], ['y'], 'subset') is in the data and (['y'], ['z'], 'subset') is in the data, then (['x'], ['z'], 'subset') is as well (transitivity of subset).

Here's the full label set:

In [27]:
simulated_labels = ['disjoint', 'equal', 'neutral', 'subset', 'superset']

These are interpreted as disjoint. For example, 'subset' is proper subset and 'superset' is proper superset – bothe exclude the case where the two arguments are equal.

As usual, we have to do a little bit of work to prepare the data for use with ClassifierRNN:

In [28]:
def build_sim_dataset(dataset):
    """Map `dataset`, in the same format as `simulated_data`, to a 
    dataset that is suitable for use with `ClassifierRNN`: the input 
    sequences are flattened into a single list and the label string 
    is mapped to the appropriate one-hot vector.    
    rnn_dataset = []
    for (p, q), rel in dataset:
        y_ = np.zeros(len(simulated_labels))
        y_[simulated_labels.index(rel)] = 1.0        
        rnn_dataset.append((p+q, y_))
    return rnn_dataset            

Finally, here is the full vocabulary, which you'll need in order to create embedding spaces:

In [29]:
sim_vocab = ["not"] + sorted(set([p[0] for x,y in simulated_data for p in x]))

['not', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n']

Task 1: Experiment function [2 points]

Complete the function sim_experiment so that it trains a ClassifierRNN on a dataset produced by build_sim_dataset and evaluates that classifier on a dataset produced by build_sim_dataset:

In [30]:

def sim_experiment(train_dataset, test_dataset, word_dim=10, hidden_dim=10, eta=0.001, maxiter=100):
    # Create an embedding for `sim_vocab`:

    # Change the value of `mod` to a `ClassifierRNN` instance using 
    # the user-supplied arguments to `sim_experiment`:

    # Fit the model:

    # Return the evaluation on `test_dataset`:
    return rnn_model_evaluation(mod, test_dataset, labels=simulated_labels)

Submit: Your completed sim_experiment.

Task 2: Memorize the training data [1 point]

Fiddle with sim_experiment until you've found settings that yield perfect accuracy on the training data. In other words, if d is the dataset you created with build_sim_dataset, then sim_experiment(d, d) should yield perfect performance on all classes. (If it's a little off, that's okay.)

Submit: Your function call to sim_experiment showing the values of all the parameters. If you need to write any code to prepare arguments for the function call, then include those lines as well.

Tip: set eta very low. This will lead to slower but more stable learning. You might also pick high word_dim and hidden_dim to ensure that you have sufficient representational power. These settings in turn demand a large number of iteration.

In [31]:

Task 3: Negation and generalization [1 point]

Now that we've established that the model works, we want to start making the data more complex. To do this, we'll simply negate one or both arguments and assign them the relation determined by their original label and the logic of negation. For instance, the training instance

p q, subset

will become

not p not q, superset p not q, disjoint not p q, overlap

The full logic of this is a somewhat liberal interpretation of the theory of negation developed by MacCartney and Manning 2007.

$$ \begin{array}{c c} \hline & \text{not-}p, \text{not-}q & p, \text{not-}q & \text{not-}p, q \\ \hline p \text{ disjoint } q & \text{neutral} & \text{subset} & \text{superset} \\ p \text{ equal } q & \text{equal} & \text{disjoint} & \text{disjoint} \\ p \text{ neutral } q & \text{neutral} & \text{neutral} & \text{neutral} \\ p \text{ subset } q & \text{superset} & \text{disjoint} & \text{neutral} \\ p \text{ superset } q & \text{subset} & \text{neutral} & \text{disjoint} \\ \hline \end{array} $$

If you don't want to worry about the details, that's fine – you can treat negate_dataset as a black-box. Just think of it as implementing the theory of negation.

In [32]:
def negate_dataset(dataset):
    """Map `dataset` to a new dataset that has been thoroughly negated."""
    new_dataset = []
    for (p, q), rel in dataset:        
        neg_p = ["not"] + p
        neg_q = ["not"] + q
        combos = [[neg_p, neg_q], [p, neg_q], [neg_p, q]]
        new_rels = None
        if rel == "disjoint":
            new_rels = ("neutral", "subset", "superset")
        elif rel == "equal":
            new_rels = ("equal", "disjoint", "disjoint") 
        elif rel == "neutral":
            new_rels = ("neutral", "neutral", "neutral")
        elif rel == "subset":
            new_rels = ("superset", "disjoint", "neutral")
        elif rel == "superset":
            new_rels = ("subset", "neutral", "disjoint") 
        new_dataset += zip(combos, new_rels)
    return new_dataset

Using negate_dataset, we can map the base dataset to a singly negated one and then create a ClassifierRNN dataset from that:

In [33]:
neg1 = negate_dataset(simulated_data)
neg1_rnn = build_sim_dataset(neg1)

Your task: use your sim_experiment to train a network on train_dataset plus neg1, and evaluate it on a dataset that has been doubly negated by running negate_dataset(neg1) and preparing the result for use with a ClassifierRNN. Use the same hyperparameters that you used to memorize the data for task 2.

Submit: the code you write to run this experiment and the output (which should be from a use of sim_experiment).

A note on performance: our mean F1 dropped to about 0.61, because we stuck to the rules and used exactly the configuration that led to perfect results on the training set above, as is required. You will not be evaluated based on the numbers you achieve, but rather only on whether you successfully run the required experiment.

That's all that's required. Of course, we hope you are now extremly curious to see whether you can find hyperparameters that generalize well to double negation, and how many times you can negate a dataset and still get good predictions out! neg3 and beyond?!

In [34]: