cs224n Lecture Notes 2

June 11th, 2017

Keyphrases:

GloVe. Intrinsic and extrinsic evaluations. Effect of hyperparameters on analogy evaluation tasks. Correlation of human judgment with word vector distances. Dealing with ambiguity in word using contexts. Window classification.

description: This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be used to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural networks as a class of models for natural language processing tasks.

Global Vectors for Word Representation (GloVe)

Comparison with Previous Methods

Two main classes of methods to find word embeddings.

count-based and rely on matrix factorization (e.g. LSA, HAL).
- leverage global statistical information
shallow window-based (e.g. the skip-gram and the CBOW models)
- capture complex linguistic patterns beyond word similarity
- fail to make use of the global co-occurrence statistics.

Co-occurrence Matrix

X: the word-word co-occurrence matrix

$X_{ij}$: the number of times word j occur in the context of word i.

$P_{ij}=P(w_j\mid w_i)=\frac{X_{ij}}{X_i}$: the probability of j appearing in the context of word i.

Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.

Least Squares Objective

In skip-gram model, we use softmax to compute the probability of word j appears in the context of word i:

$Q_{ij} = softmax(u_jv_i)$

the implied global cross-entropy loss can be calculated as:

$J = -\sum_{i\in corpus}\sum_{j\in context(i)}logQ_{ij}$

The same words i and j can appear multiple times in the corpus,it is more efficient to first group together the same values for i and j:

$J = -\sum_{i=1}^{W}\sum_{j=1}^{W}X_{ij}logQ_{ij}$

the cross-entropy loss requires the distribution Q to be properly normalized

$\hat{J} = -\sum_{i=1}^{W}\sum_{j=1}^{W}X_{i}(\hat{P}_{ij} - \hat{Q}_{ij})$

where $\hat{P}{ij}=X{ij}$ and $\hat{Q}_{ij}=exp(u_j^Tv_i)$

Conclusion

the GloVe model efficiently leverages global statistical information by training only on the nonzero elements in a wordword co-occurrence matrix, and produces a vector space with meaningful sub-structure.

Evaluation of Word Vectors

Intrinsic Evaluation

Evaluation on a specific, intermediate task
Fast to compute performance
Helps understand subsystem
Needs positive correlation with real task to determine usefulness

Extrinsic Evaluation

Is the evaluation on a real task
Can be slow to compute performance
Unclear if subsystem is the problem, other subsystems, or internal interactions
If replacing subsystem improves performance, the change is likely good

Training for Extrinsic Tasks

we introduce the idea of retraining the input word vectors when we train for extrinsic tasks, but the training set should be large enough to cover most words from the vocabulary.
softmax loss on a dataset of N points, k is the index of the correct class:

$-\sum_{i=1}^{N}log(\frac{exp(W_{k(i)}x^{(i)})}{\sum_{c=1}^Cexp(W_cx^{(i)})})$

Window Classification: substitute $x^{(i)}$ with $x_{window}^{(i)}$ in the following manner:

$x_{window}^{(i)} = \left[ \begin{matrix} x^{(i-2)} \\ x^{(i-1)} \\ x^{(i)} \\ x^{(i+1)} \\ x^{(i+2)} \\ \end{matrix} \right]$

ShengYg

Step after step the ladder is ascended.