ShengYg's Blog

September 1st, 2017

cs224n Lecture Notes 7

This set of notes discusses and describes the many variants on the RNN (Recursive Neural Networks) and their application and successes in the field of NLP.

July 1st, 2017

$\begin{bmatrix} f_{n+1} \\ f_n \\ \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} f_n \\ f_{n-1} \\ \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}^n \begin{bmatrix} f_1 \\ f_0 \\ \end{bmatrix}$

用矩阵幂运算由$O(n)$加速到$O(log(n))$

June 28th, 2017

cs224n Lecture Notes 6

Keyphrases:

Seq2Seq and Attention Mechanisms, Neural Machine Translation, Speech Processing

Neural Machine Translation with Seq2Seq

encoder

The encoder network’s job is to read the input sequence to our Seq2Seq model and generate a fixed-dimensional context vector C for the sequence.

LSTM or stacked LSTM: compress an arbitrary-length sequence into a single fixed-size vector.
process the input sequence in reverse

decoder

Initialize the hidden state of our first layer with the context vector from above.
pass in a special token to signify the start of output generation.
pass that output word into the first layer, and repeat the generation.

Bidirectional RNNs

we simply add another cell but feed inputs to it in the opposite direction.

Attention Mechanism

Attention mechanisms make use of this observation by providing the decoder network with a look at the entire input sequence at every decoding step; the decoder can then decide what input words are important at any point in time.

$h_i$: hidden vectors representing the input sentence.
$s_i$: hidden vectors representing the output sentence.
$y_i$: generated word
$c_i$: context vector that capture the context from the original sentence that is relevant to the time step i of the decoder.

$e_{i,j}=a(s_{i-1},h_j)$ $\alpha_{i,j} = \frac{exp(e_{i,j})}{\sum_{k=1}^nexp(e_{i,k})}$ $c_i = \sum_{j=1}^n\alpha_{i,j}h_j$ $s_i=f(s_{i-1}, y_{i-1}, c_i)$

scores $\alpha_{i,j}$ at decoding step i signify the words in the source sentence that align with word i in the target.we can use attention scores to build an alignment table – a table mapping words in the source to corresponding words in the target sentence.

This model can efficiently translate long sentences.

Sequence model decoders

Another approach to machine translation comes from statistical machine translation. Consider a model that computes the probability $P(\bar{s}\mid s)$ of a translation $\bar{s}$ given the original sentence $s$.we want

$\bar{s}\ast = argmax_{\bar{s}}(P(\bar{s}\mid s))$

As the search space can be huge, we need to shrink its size.

Beam search: the idea is to maintain K candidates at each time step.

$H_t = \{(x_1^1,...,x_t^1),...,(x_1^K,...,x_t^K)\}$

and compute $H_{t+1}$ by expanding $H_t$ and keeping the best $K$ candidates.

Evaluation of Machine Translation Systems

Bilingual Evaluation Understudy (BLEU)

Let $p_n$ denote matched n-grams,$w_n=1/2^n$be a geometric weighting for the precision of the n’th gram.Our brevity penalty is defined as

$\beta=e^{(\min(0,1-\frac{len_{ref}}{len_{MT}}))}$

where $len_{ref}$ is the length of the reference translation and $len_{MT}$ is the length of the machine translation.

The BLEU score is then defined as

$BLUE=\beta\prod_{i=1}^kp_n^{w_n}$

Dealing with the large output vocabulary

Softmax can be quite expensive to compute with a large vocabulary and its complexity also scales proportionally to the vocabulary size.

Scaling softmax

Noise Contrastive Estimation: randomly sampling K words from negative samples.
Hierarchical Softmax

They only save computation during training step.

Reducing vocabulary

Partitioning the training data into subsets with t unique target words.This concept is very similar to NCE.However, the main difference is that these negative samples are sampled from a biased distribution Q for each subset V’ where

$Q(y_t) = \begin{cases} \frac{1}{\mid V'\mid}, & \text{if $y_t\in\mid V'\mid$} \\ 0, & \text{otherwise} \\ \end{cases}$

Handling unknown words

The final prediction is either the word $y_t^w$ chosen by softmax over candidate list, as in previous methods, or $y_t^l$ copied from source text.

Word and character-based models

It deal with rare or unknown words.

Word segmentation

Representing rare and unknown words as a sequence of subword units.

One can choose to either build separate vocabularies for training and test sets or build one vocabulary jointly.

Hybrid NMT

Word-based Translation as a Backbone: The core of the hybrid NMT is a deep LSTM encoder-decoder that translates at the word level.

Source Character-based Representation: We learn a deep LSTM model over characters of rare words, and use the final hidden state of the LSTM as the representation for the rare word.

Target Character-level Generation: The solution is to have a separate deep LSTM that “translates” at the character level given the current word-level state.The system is trained such that whenever the wordlevel NMT produces an $\langle unk\rangle$, the character-level decoder is asked to recover the correct surface form of the unknown target word.

June 24th, 2017

cs224n Lecture Notes 5

Keyphrases:

Language Models. RNN. Bi-directional RNN. Deep RNN. GRU. LSTM.

Language Models

Language models compute the probability of occurrence of a number of words in a particular sequence.

$P(w_1,...,w_n)=\prod_{i=1}^{i=m}P(w_i\mid w_1,...,w_{i-1})\approx \prod_{i=1}^{i=m}P(w_i\mid w_{i-n},...,w_{i-1})$

In some cases, the window of past consecutive n words may not be sufficient to capture the context.Bengio et al. introduced the first large-scale deep learning for natural language processing model that enables capturing this type of context via learning a distributed representation of words.

In all conventional language models, the memory requirements of the system grows exponentially with the window size n making it nearly impossible to model large word windows without running out of memory.

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) are capable of conditioning the model on all previous words in the corpus.

Below are the details associated with each parameter in the network:

$x_1,…,x_t,…,x_T$: the word vectors corresponding to a corpus with T words
$h(t)=\sigma(W^{(hh)}h_{t-1}+W^{(hx)}x_{t}})$: the relationship to compute the hidden layer output features at each time-step t
$\hat{y_t}=softmax(W^{(S)}h_t)$: the output probability distribution over the vocabulary at each time-step t

The loss function used in RNNs is often the cross entropy error.

$J=-\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{\mid V\mid}y_{t,j}log(\hat{y}_{t,j})$

Perplexity relationship; it is basically 2 to the power of the negative log probability of the cross entropy error.Perplexity is a measure of confusion where lower values imply more confidence in predicting the next word in the sequence.

$perplexity=2^J$

The amount of memory required to run a layer of RNN is proportional to the number of words in the corpus.

Deep Bidirectional RNNs

To make predictions based on future words by having the RNN model read through the corpus backwards

Gated Recurrent Units

Although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies.

$\begin{align} z_t &= \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}) &&\text{(update gate)} \\ r_t &= \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}) &&\text{(reset gate)} \\ \hat{h}_t &= tanh(r_t\circ Uh_{t-1} + Wx_t) &&\text{(new memory)} \\ h_t &= (1-z_t)\circ\hat{h}_t + z_t\circ h_{t-1} &&\text{(hiiden state)} \\ \end{align}$

Long-Short-Term-Memories

$\begin{align} i_t &= \sigma(W^{(i)}x_t+U^{(i)}h_{t-1}) &&\text{(input gate)} \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) &&\text{(forget gate)} \\ o_t &= \sigma(W^{(o)}x_t+U^{(o)}h_{t-1}) &&\text{(output gate)} \\ \tilde{c}_t &= \tanh(W^{(c)}x_t+U^{(c)}h_{t-1}) &&\text{(new memory cell)} \\ c_t &= f_t\circ c_{t-1} + i_t\circ\tilde{c}_t &&\text{(final memory cell)} \\ h_t &= o_t\circ \tanh(c_t) \\ \end{align}$

ShengYg's Blog

ShengYg's Blog

c++ primer note

图算法

loss function

Python itertools

git 远程命令

tree debug

cs224n Lecture Notes 7

matrix math

应用：递推式矩阵加速

cs224n Lecture Notes 6

Neural Machine Translation with Seq2Seq

encoder

decoder

Bidirectional RNNs

Attention Mechanism

Sequence model decoders

Evaluation of Machine Translation Systems

Bilingual Evaluation Understudy (BLEU)

Dealing with the large output vocabulary

Scaling softmax

Reducing vocabulary

Handling unknown words

Word and character-based models

Word segmentation

Hybrid NMT

cs224n Lecture Notes 5

Language Models

Recurrent Neural Networks (RNN)

Deep Bidirectional RNNs

Gated Recurrent Units

Long-Short-Term-Memories