Keyphrases:

GloVe. Intrinsic and extrinsic evaluations. Effect of hyperparameters on analogy evaluation tasks. Correlation of human judgment with word vector distances. Dealing with ambiguity in word using contexts. Window classification.

description: This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be used to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural networks as a class of models for natural language processing tasks.


Global Vectors for Word Representation (GloVe)

Comparison with Previous Methods

Two main classes of methods to find word embeddings.

  • count-based and rely on matrix factorization (e.g. LSA, HAL).
    • leverage global statistical information
  • shallow window-based (e.g. the skip-gram and the CBOW models)
    • capture complex linguistic patterns beyond word similarity
    • fail to make use of the global co-occurrence statistics.

Co-occurrence Matrix

X: the word-word co-occurrence matrix

$X_{ij}$: the number of times word j occur in the context of word i.

$P_{ij}=P(w_j\mid w_i)=\frac{X_{ij}}{X_i}$: the probability of j appearing in the context of word i.

Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.

Least Squares Objective

In skip-gram model, we use softmax to compute the probability of word j appears in the context of word i:

the implied global cross-entropy loss can be calculated as:

The same words i and j can appear multiple times in the corpus,it is more efficient to first group together the same values for i and j:

the cross-entropy loss requires the distribution Q to be properly normalized

where $\hat{P}{ij}=X{ij}$ and $\hat{Q}_{ij}=exp(u_j^Tv_i)$

Conclusion

the GloVe model efficiently leverages global statistical information by training only on the nonzero elements in a wordword co-occurrence matrix, and produces a vector space with meaningful sub-structure.

Evaluation of Word Vectors

Intrinsic Evaluation

  • Evaluation on a specific, intermediate task
  • Fast to compute performance
  • Helps understand subsystem
  • Needs positive correlation with real task to determine usefulness

Extrinsic Evaluation

  • Is the evaluation on a real task
  • Can be slow to compute performance
  • Unclear if subsystem is the problem, other subsystems, or internal interactions
  • If replacing subsystem improves performance, the change is likely good

Training for Extrinsic Tasks

  • we introduce the idea of retraining the input word vectors when we train for extrinsic tasks, but the training set should be large enough to cover most words from the vocabulary.

  • softmax loss on a dataset of N points, k is the index of the correct class:

  • Window Classification: substitute $x^{(i)}$ with $x_{window}^{(i)}$ in the following manner:

ShengYg

Step after step the ladder is ascended.


Tags •