cs224n Lecture Notes 5

June 24th, 2017

Keyphrases:

Language Models. RNN. Bi-directional RNN. Deep RNN. GRU. LSTM.

Language Models

Language models compute the probability of occurrence of a number of words in a particular sequence.

$P(w_1,...,w_n)=\prod_{i=1}^{i=m}P(w_i\mid w_1,...,w_{i-1})\approx \prod_{i=1}^{i=m}P(w_i\mid w_{i-n},...,w_{i-1})$

In some cases, the window of past consecutive n words may not be sufficient to capture the context.Bengio et al. introduced the first large-scale deep learning for natural language processing model that enables capturing this type of context via learning a distributed representation of words.

In all conventional language models, the memory requirements of the system grows exponentially with the window size n making it nearly impossible to model large word windows without running out of memory.

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) are capable of conditioning the model on all previous words in the corpus.

Below are the details associated with each parameter in the network:

$x_1,…,x_t,…,x_T$: the word vectors corresponding to a corpus with T words
$h(t)=\sigma(W^{(hh)}h_{t-1}+W^{(hx)}x_{t}})$: the relationship to compute the hidden layer output features at each time-step t
$\hat{y_t}=softmax(W^{(S)}h_t)$: the output probability distribution over the vocabulary at each time-step t

The loss function used in RNNs is often the cross entropy error.

$J=-\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{\mid V\mid}y_{t,j}log(\hat{y}_{t,j})$

Perplexity relationship; it is basically 2 to the power of the negative log probability of the cross entropy error.Perplexity is a measure of confusion where lower values imply more confidence in predicting the next word in the sequence.

$perplexity=2^J$

The amount of memory required to run a layer of RNN is proportional to the number of words in the corpus.

Deep Bidirectional RNNs

To make predictions based on future words by having the RNN model read through the corpus backwards

Gated Recurrent Units

Although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies.

$\begin{align} z_t &= \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}) &&\text{(update gate)} \\ r_t &= \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}) &&\text{(reset gate)} \\ \hat{h}_t &= tanh(r_t\circ Uh_{t-1} + Wx_t) &&\text{(new memory)} \\ h_t &= (1-z_t)\circ\hat{h}_t + z_t\circ h_{t-1} &&\text{(hiiden state)} \\ \end{align}$

Long-Short-Term-Memories

$\begin{align} i_t &= \sigma(W^{(i)}x_t+U^{(i)}h_{t-1}) &&\text{(input gate)} \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) &&\text{(forget gate)} \\ o_t &= \sigma(W^{(o)}x_t+U^{(o)}h_{t-1}) &&\text{(output gate)} \\ \tilde{c}_t &= \tanh(W^{(c)}x_t+U^{(c)}h_{t-1}) &&\text{(new memory cell)} \\ c_t &= f_t\circ c_{t-1} + i_t\circ\tilde{c}_t &&\text{(final memory cell)} \\ h_t &= o_t\circ \tanh(c_t) \\ \end{align}$

ShengYg

Step after step the ladder is ascended.