Latent Tree Induction #2

This post is a continuation of my latent tree induction research notes.

A large annotated corpus for learning natural language inference (Stanford NLI Paper, 2015)

Natural Language Inference is the task of testing for entailment, contradiction, or independence of one statement in relation to another. It tests for a full range of logical and commonsense inferences. This paper presents the Stanford NLI (SNLI) corpus, a corpus of 570,172 sentence pairs labelled for their human-evaluated relationship (whether entailment, contradiction or independence), averaged over 5 evaluators. Prior corpora had issues of entity coreference (examples of entities: a, the I), where the labelling scheme in denoting if statements are entailing or contradicting tend to be catastrophically skewed in either categories. In producing a more robust dataset, the Amazon Mechanical Turk was used in crowdsourcing labelled statement pairs (2 orders of magnitude larger than SOTA), where premise and hypothesis sentences in each pair are constrained to a single perspective (to resolve entity coreference indeterminacy). The SNLI corpus was shown to be best by LSTM (Hochreiter and Schmidhuber, 1997) in the NLI task, with 77.6% accuracy on the test set. This historical paper is particularly interesting to me as it would be useful to run NLI on ON-LSTM, given the stellar performance of the vanilla LSTM on this task.

ListOps: A Diagnostic Dataset for Latent Tree Learning (2018)

To test the representational abilities of latent tree learning models, Bowman et al proposed ListOps, which is able to better test parsing abilities of said models on language with syntatic formalism. Concretely, the dataset comprises strings of numbers and mathematical operators (spelled out), e.g. [MAX 2 9 [MIN 4 7 ] 0 ], and forcing the neural model to learn a successful parsing strategy. Relying on the formal syntax in the mathematical language disambiguifies validity of parsed output, a problem plagued by previous datasets testing parsing, i.e. there exists only one correct answer exists per input sequence (answer = 9 for the previous example given). This dataset is essentially solved by TreeLSTM at 98.7% accuracy, which is expected given its access to the ground truth. However, existing models without access to ground-truth parses (then, at 2018) tops at 74.4% accuracy for LSTM. It could be interesting to see if the ON-LSTM can outperform LSTM, given the latter’s ability to learn latent tree structures in language.

Deep Learning for Symbolic Mathematics (ICLR 2020)

This paper proposes crafting a computer algebra solvers as tree-to-tree matching problems. Concretely, the model treats equations as trees representing hierarchies of mathematical constituents within an expression, where for example the first + in 2 + 3 X (5 + 2) is the root of the expression as it represents the last mathematical operator. These trees are represented as sequences, and the tree-to-tree matching comes from a seq2seq Transformers model used to convert a mathematical expression to another. This is important as computer algebra solvers are particularly useful for solving differential equations, where a mathematical expression (an equation) is the solution to the differential equation itself, and the system can also be used in finding the simplest expression to constitute the solution. Interestingly, they greatly outperform commercial solvers like Matlab and Wolfram Mathematica, with accuracies of 97-100% VS ~80% from said solvers.

Inducing Constituency Trees through Neural Machine Translation (2019)

Previously, empirical evidence show that only models trained on the language modeling (LM) task have sufficient inductive tree biases. This paper breaks this assumption by showing the existence of trees on models with good tree-induction biases (ON-LSTM, PRPN) when trained on the WMT En-De dataset for the machine translation task (inserted into the decoder), obtaining F1 scorers of 49.4 matching that when trained on LM. The PRPN model scores even better at 56.1, although the PRPN parsing algorithm can be overly biased to the right-branched English language (as shown by Dyer et al, 2019). That being said, further statistical analysis of the results of this paper reveals that the variance in F1 scores are too high to be reliable (stdev ~12-16).

Combiner: Inductively Learning Tree-Structured Attention in Transformers (ICLR 2020)

The proposed Combiner model is a Transformers model that uses Sparse Attention Gates (SAG) to learn sparse attention connections in tokens, combining them heirarchically across layers to form trees. The Hierarchical Attention Block then propogates the gating signal to lower levels, making the child nodes connected to parent nodes. This model performs exceedingly well on the unsupervised constituency task, obtaining an F1 score of 65.1 over ON-LSTM’s 49.4.

A simple neural network module for relational reasoning (NIPS 2017)

Classic paper by DeepMind introducing a relational reasoning module, Relational Networks (RNs). RNs are mathematically expressed as a composite sum of the relationships between each node in the neural network (which can be thus represented as vertices and edges in a graph). They have been shown to be agnostic to prior biases, working equally well with spatio-temporal embeddings (i.e. generated from RNNs/CNNs). This is illustrated through good performance on VQA, bAbI and even understanding physics through dynamic physical system inference; it could be interesting to see how juxtaposing relational reasoning with hierarchical reasoning can possibly improve learning representational abilities.

Relational recurrent networks (NIPS 2018)

This paper shows how the preceeding Relational Networks only actually utilize a one-hop relational reasoning mechanism, whereas this method proposes a multi-hop reasoning mechanism. This is done by having the graph network represented by a recurrent network, where the nodes and vertices are updated iteratively through time. The updating of vertices is learnt by a neural function. This is shown to be able to solve difficult problems, obtaining SOTA on tasks like bAbI QA, a more challenging variant of CLEVR (Pretty-CLEVR) and even Sudoku.

A Critical Analysis of Biased Parsers in Unsupervised Parsing (2019)

This analysis paper by Chris Dyer & Phil Blunsom (DeepMind guys in hierarchical reasoning space) shows how the Ordered Neurons LSTM has a right-branching bias in the unsupervised parsing task, allowing it to get performance gains on right-branching languages like English. This is due to the parsing algorithm’s inability to reproduce all possible trees, arising from not being able to parse a closed-open-open sequence ‘][[’. This paper sheds some light on the pros and cons of a branching bias (which can be interpreted as a strategy)- while it is indeed biased to producing right-branching trees, this is the superior strategy when the data has an underlying right-branching bias, as shown when the algorithm is implemented using both ON-LSTM and vanilla LSTM models. The cons, however, is in when one uses models with these directional-biases in evaluating languages with the other directions (e.g. left-branching model on right-branching language). One concluding open question: is there a fairer way to evaluate cross-lingual language models? Are there better strategies in unsupervised parsing that does not algorithmically have directional biases encoded in them?

Do latent tree learning models identify meaningful structure in sentences? (2018)

This analysis work deeply explores the notion of trees successfully modeling linguistic information, by analysing ST Gumbel-TreeLSTM and RL-SPINN (see above two papers read for more details). Some key findings of this paper includes: the latent trees produced by these models are consistent across random starts; the trees produced do not seem to match well with human-annotated trees in the Penn Treebank dataset; trees generated tend to be shallow. These insights incite follow-up questions: what to the latent trees represent? Is pre-training on human-annotated trees actually useful for downstream NLP tasksm given good performance regardless?

Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered Neurons (2019)

Interesting follow-up work of the ON-LSTM, by applying self-attention as a layered encoder over the baseline ON-LSTM encoder; the sum of outputs from both the self-attention network and the ON-LSTM is then fed into the Transformers module, a la the Short-Cut Connection method (a simultaneous exposure of the previous and preceeding layers into the next). This form of model hybridizing RNNs and Self-Attention models seems to perform well, beating baselines in WMT14 Eng-German for Machine Translation, Logical Inference and Syntactic Evaluation tasks. This Short-Cut mechanism and hybridizing strategy is interesting to look further into.

Relational Recurrent Neural Networks (2018)

This paper shows how standard memory architectures fail at relational tasks (i.e. in understanding how entities are connected). They then proposed a Relational Memory Core module, employing multi-head dot product attention to allow memories to interact, reaching SOTA on language modeling for larger datasets (e.g. WikiText-2).

Conclusion

As my PhD is on improving models on the edge, I wanted to tackle this problem in two ways- improving the performance of models, and reducing the loss in compression methods. Owing to the ubiquitous hierarchies in natural language, I researched into the hierarchical inductive bias as a means to improve the models’ performance on NLP tasks. For my next line of work, I shall look into improving compression methods, in order to reduce the size of models deployed on teh edge while maintaining performance.

Bill Pung