ShengYg's Blog

June 18th, 2017

cs224n Lecture Notes 4

Keyphrases:

Dependency Parsing.

Dependency Grammar and Dependency Structure

Two main types of structures: constituency structures and dependency structures.

Dependency Parsing

Dependency parsing is the task of analyzing the syntactic dependency structure of a given input sentence S. The output of a dependency parser is a dependency tree where the words of the input sentence are connected by typed dependency relations

Two subproblems:

Learning: Given a training set D of sentences annotated with dependency graphs, induce a parsing model M that can be used to parse new sentences.
Parsing: Given a parsing model M and a sentence S, derive the optimal dependency graph D for S according to M.

Transition-Based Dependency Parsing

It relies on a state machine.Most transition-based systems do not make use of a formal grammar.

learning: induce a model which can predict the next transition in the state machine based on the transition history
parsing: construct the optimal sequence of transitions for the input sentence, given the previously induced model.

Greedy Deterministic Transition-Based Parsing

States:

For any sentence $S=w_0w_1…w_n$, a state can be described with a triple $c=(\sigma, \beta, A)$:

a stack $\sigma$ of words $w_i$ from S
a buffer $\beta$ of words $w_i$ from S
a set of dependency arcs A of the form $(w_i, r, w_j)$, where $w_i$, $w_j$ are from S, and r describes a dependency relation.

initial state:$({[w_0]}{\sigma}, {[w_1,…,w_n]}{\beta}, \phi)$

terminal state:$(\sigma, []_{\beta}, A)$

Transitions:

shift: $\sigma, w_i\mid\beta,A\rightarrow \sigma\mid w_i, \beta,A$
left-arc: $\sigma\mid w_i\mid w_j, \beta,A\rightarrow \sigma\mid w_j, \beta,A\bigcup{r(w_j, w_i)}$
right-arc: $\sigma\mid w_i\mid w_j, \beta,A\rightarrow \sigma\mid w_i, \beta,A\bigcup{r(w_i, w_j)}$

Neural Dependency Parsing

June 14th, 2017

debug network

数据集

保证输入数据有效
随机数输入，应该有不同的结果，确定网络没有在某一层将数据转变为垃圾
保证数据载入过程有效，可以打印第一层结果
数据随机性不能太大（例如股票），这是数据本身的属性
随机检测标注是否准确
打乱数据顺序
类别不平衡？
如果从头训练，需要足够多的样本

数据归一化/扩增

使用预处理模型时，保证归一化一致
对训练集预处理，然后将同样的机制应用到测试集上
避免过量的数据扩增，导致训练欠拟合

实现细节

初始化后，观察网络是否输出有效的结果
调整loss权重
对自定义层进行测试
保证梯度传播的有效性，避免某些层被“冻结”

训练问题

检查权重初始化
改变超参数
降低惩罚因数
不同的优化方法

June 11th, 2017

cs224n Lecture Notes 2

This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be used to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural networks as a class of models for natural language processing tasks.

June 8th, 2017

cs224n Lecture Notes 1

This set of notes begins by introducing the concept of Natural Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the concept of representing words as numeric vectors. Lastly, we discuss popular approaches to designing word vectors.

June 3rd, 2017

梯度下降算法

本文摘自梯度下降优化算法综述

梯度下降法的变形形式
挑战
梯度下降优化算法

1. 梯度下降法的变形形式

1.1 批梯度下降法

Vanilla梯度下降法，又称为批梯度下降法（batch gradient descent），在整个训练数据集上计算损失函数关于参数θ的梯度:

$\theta = \theta - \eta\cdot\bigtriangledown_\theta J(\theta)$

因为在执行每次更新时，我们需要在整个数据集上计算所有的梯度，所以批梯度下降法的速度会很慢，同时，批梯度下降法无法处理超出内存容量限制的数据集。批梯度下降法同样也不能在线更新模型，即在运行的过程中，不能增加新的样本。

我们利用梯度的方向和学习率更新参数，学习率决定我们将以多大的步长更新参数。对于凸误差函数，批梯度下降法能够保证收敛到全局最小值，对于非凸函数，则收敛到一个局部最小值。

1.2 随机梯度下降法

随机梯度下降法（stochastic gradient descent, SGD）根据每一条训练样本$x^{(i)}$和标签$y^{(i)}$更新参数：

$\theta = \theta - \eta\cdot\bigtriangledown_\theta J(\theta;x^{(i)};y^{(i)})$

对于大数据集，因为批梯度下降法在每一个参数更新之前，会对相似的样本计算梯度，所以在计算过程中会有冗余。而SGD在每一次更新中只执行一次，从而消除了冗余。因而，通常SGD的运行速度更快，同时，可以用于在线学习。SGD以高方差频繁地更新，导致目标函数出现如图1所示的剧烈波动。

与批梯度下降法的收敛会使得损失函数陷入局部最小相比，由于SGD的波动性，一方面，波动性使得SGD可以跳到新的和潜在更好的局部最优。另一方面，这使得最终收敛到特定最小值的过程变得复杂，因为SGD会一直持续波动。然而，已经证明当我们缓慢减小学习率，SGD与批梯度下降法具有相同的收敛行为，对于非凸优化和凸优化，可以分别收敛到局部最小值和全局最小值。

1.3 小批量梯度下降法

小批量梯度下降法最终结合了上述两种方法的优点，在每次更新时使用n个小批量训练样本：

$\theta = \theta - \eta\cdot\bigtriangledown_\theta J(\theta;x^{(i:i+n)};y^{(i:i+n)})$

2. 挑战

虽然Vanilla小批量梯度下降法并不能保证较好的收敛性，并且，给我们留下了如下的一些挑战：

选择一个合适的学习率可能是困难的
学习率调整算法试图在训练的过程中通过例如退火的方法调整学习率，即根据预定义的策略或者当相邻两代之间的下降值小于某个阈值时减小学习率。然而，策略和阈值需要预先设定好，因此无法适应数据集的特点
高度非凸误差函数普遍出现在神经网络中，在优化这类函数时，另一个关键的挑战是使函数避免陷入无数次优的局部最小值

3. 梯度下降优化算法

3.1 动量法 Momentum

动量法将历史步长的更新向量的一个分量$\gamma$增加到当前的更新向量中:

$\upsilon_t = \gamma\upsilon_{t-1}+\eta\bigtriangledown_\theta J(\theta)$ $\theta = \theta - \upsilon_t$

从本质上说，动量法，就像我们从山上推下一个球，球在滚下来的过程中累积动量，变得越来越快（直到达到终极速度，如果有空气阻力的存在，则$\gamma$<1）。同样的事情也发生在参数的更新过程中：对于在梯度点处具有相同的方向的维度，其动量项增大，对于在梯度点处改变方向的维度，其动量项减小。因此，我们可以得到更快的收敛速度，同时可以减少摇摆。

3.2 Nesterov加速梯度下降法

球从山上滚下的时候，盲目地沿着斜率方向，往往并不能令人满意。我们希望有一个智能的球，这个球能够知道它将要去哪，以至于在重新遇到斜率上升时能够知道减速。

Nesterov加速梯度下降法（Nesterov accelerated gradient，NAG）是一种能够给动量项这样的预知能力的方法。我们知道，我们利用动量项$\gamma\upsilon_{t-1}$来更新参数$\theta$。通过计算$\theta-\gamma\upsilon_{t-1}$能够告诉我们参数未来位置的一个近似值（梯度并不是完全更新），这也就是告诉我们参数大致将变为多少。通过计算关于参数未来的近似位置的梯度，而不是关于当前的参数θ的梯度，我们可以高效的求解：

$\upsilon_t = \gamma\upsilon_{t-1}+\eta\bigtriangledown_\theta J(\theta-\gamma\upsilon_{t-1})$ $\theta = \theta - \upsilon_t$

动量法首先计算当前的梯度值（小的蓝色向量），然后在更新的累积梯度（大的蓝色向量）方向上前进一大步，Nesterov加速梯度下降法NAG首先在先前累积梯度（棕色的向量）方向上前进一大步，计算梯度值，然后做一个修正（绿色的向量）。这个具有预见性的更新防止我们前进得太快，同时增强了算法的响应能力，这一点在很多的任务中对于RNN的性能提升有着重要的意义。

3.3 Adagrad

Adagrad是这样的一种基于梯度的优化算法：让学习率适应参数，对于出现次数较少的特征，我们对其采用更大的学习率，对于出现次数较多的特征，我们对其采用较小的学习率。因此，Adagrad非常适合处理稀疏数据。

前面，我们每次更新所有的参数$\theta$时，每一个参数$\theta_i$都使用的是相同的学习率$\eta$。Adagrad在t时刻对每一个参数$\theta_i$使用了不同的学习率。我们首先介绍Adagrad对每一个参数的更新，然后我们对其向量化。为了简洁，令$g_{t,i}$为在$t$时刻目标函数关于参数$\theta_i$的梯度：

$g_{t,i} = \bigtriangledown_\theta J(\theta_i)$

在t时刻，对每个参数$\theta_i$的更新过程变为：

$\theta_{t+1,i} = \theta_{t,i}-\eta\cdot g_{t,i}$

对于上述的更新规则，在t时刻，基于对$\theta_i$计算过的历史梯度，Adagrad修正了对每一个参数$\theta_i$的学习率：

$\theta_{t+1,i} = \theta_{t,i}-\frac{\eta}{\sqrt{G_{t,ii}+\epsilon}}\cdot g_{t,i}$

其中，$G_t\in R^{d\times d}$是一个对角矩阵，对角线上的元素$i,i$是直到t时刻为止，所有关于$\theta_i$的梯度的平方和，$\epsilon$是平滑项，用于防止除数为0（通常大约设置为1e−8）。比较有意思的是，如果没有平方根的操作，算法的效果会变得很差。

由于$G_t$的对角线上包含了关于所有参数$theta$的历史梯度的平方和，现在，我们可以通过$G_t$和$g_t$之间的元素向量乘法$\odot$向量化上述的操作：

$\theta_{t+1} = \theta_{t}-\frac{\eta}{\sqrt{G_{t}+\epsilon}}\cdot g_{t}$

Adagrad算法的一个主要优点是无需手动调整学习率。在大多数的应用场景中，通常采用常数0.01。

Adagrad的一个主要缺点是它在分母中累加梯度的平方：由于每增加一个正项，在整个训练过程中，累加的和会持续增长。这会导致学习率变小以至于最终变得无限小，在学习率无限小时，Adagrad算法将无法取得额外的信息。接下来的算法旨在解决这个不足。

3.4 Adadelta

Adadelta是Adagrad的一种扩展算法，以处理Adagrad学习速率单调递减的问题。不是计算所有的梯度平方，Adadelta将累计历史梯度的大小限制为一个固定值$\omega$。

在Adadelta中，无需存储先前的$\omega$个平方梯度，而是将梯度的平方递归地表示成所有历史梯度平方的均值。在t时刻的均值$E[g^2]_t$只取决于先前的均值和当前的梯度（分量$\gamma$类似于动量项）:

$E[g^2]_t=\gamma\cdot E[g^2]_{t-1} + (1-\gamma)g^2_t$

我们先前得到的Adagrad参数更新向量变为： $\theta_{t} = -\frac{\eta}{\sqrt{G_{t}+\epsilon}}\odot g_{t}$

现在，我们简单将对角矩阵$G_t$替换成历史梯度的均值$E[g^2]_t$：

$\theta_{t} = -\frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}\odot g_{t}$

由于分母仅仅是梯度的均方根（root mean squared，RMS）误差，我们可以简写为：

$\theta_{t} = -\frac{\eta}{RMS[g]_t}\odot g_{t}$

上述更新公式中的每个部分（与SGD，动量法或者Adagrad）并不一致，即更新规则中必须与参数具有相同的假设单位。为了实现这个要求，定义另一个指数衰减均值，这次不是梯度平方，而是参数的平方的更新：

$E[\Delta\theta^2]_t=\gamma\cdot E[\Delta\theta^2]_{t-1} + (1-\gamma)\Delta\theta^2_t$

因此，参数更新的均方根误差为：

$RMS[\Delta\theta]_t=\sqrt{E[\Delta\theta^2]_t+\epsilon}$

由于$RMS[\Delta\theta]{t}$是未知的，我们利用参数的均方根误差来近似更新。利用$RMS[\Delta\theta]{t-1}$替换先前的更新规则中的学习率$\eta$，最终得到Adadelta的更新规则：

$\Delta\theta_t = \frac{RMS[\Delta\theta]_{t-1}}{RMS[g]_t}g_t$ $\theta_{t+1} = \theta_t + \Delta\theta_t$

使用Adadelta算法，我们甚至都无需设置默认的学习率，因为更新规则中已经移除了学习率。

3.5 Adam

自适应矩估计（Adaptive Moment Estimation，Adam）是另一种自适应学习率的算法，Adam对每一个参数都计算自适应的学习率。除了像Adadelta和RMSprop一样存储一个指数衰减的历史平方梯度的平均$\upsilon_t$，Adam同时还保存一个历史梯度的指数衰减均值$m_t$，类似于动量：

$m_t = \beta_1m_{t-1}+(1-\beta_1)g_t$ $\upsilon_t = \beta_2\upsilon_{t-1}+(1-\beta_2)g^2_t$

$m_t$和$\upsilon_t$分别是对梯度的一阶矩（均值）和二阶矩（非确定的方差）的估计，正如该算法的名称。当$m_t$和$\upsilon_t$初始化为0向量时，Adam的作者发现它们都偏向于0，尤其是在初始化的步骤和当衰减率很小的时候（例如$\beta_1$和$\beta_2$趋向于1）。

通过计算偏差校正的一阶矩和二阶矩估计来抵消偏差：

$\hat{m_t} = \frac{m_t}{1-\beta^t_1}$ $\hat{\upsilon_t} = \frac{\upsilon_t}{1-\beta^t_2}$

正如我们在Adadelta和RMSprop中看到的那样，他们利用上述的公式更新参数，由此生成了Adam的更新规则：

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{\upsilon_t}+\epsilon}}\hat{m_t}$

3.6 算法可视化

Adagrad，Adadelta和RMSprop能够立即转移到正确的移动方向上并以类似的速度收敛，而动量法和NAG会导致偏离。然而，NAG能够在偏离之后快速修正其路线，因为NAG通过对最优点的预见增强其响应能力。

SGD，动量法和NAG在鞍点处很难打破对称性，尽管后面两个算法最终设法逃离了鞍点。而Adagrad，RMSprop和Adadelta能够快速想着梯度为负的方向移动，其中Adadelta走在最前面。

3.7 选择使用哪种优化算法？

总的来说，RMSprop是Adagrad的扩展形式，用于处理在Adagrad中急速递减的学习率。RMSprop与Adadelta相同，所不同的是Adadelta在更新规则中使用参数的均方根进行更新。最后，Adam是将偏差校正和动量加入到RMSprop中。在这样的情况下，RMSprop、Adadelta和Adam是很相似的算法并且在相似的环境中性能都不错。综合看来，Adam可能是最佳的选择。

May 30th, 2017

Batchnorm

Google在ICML文中描述的非常清晰，即在每次SGD时，通过mini-batch来对相应的activation做规范化操作，使得结果（输出信号各个维度）的均值为0，方差为1。而最后的“scale and shift”操作则是为了让因训练所需而“刻意”加入的BN能够有可能还原最初的输入，从而保证整个network的capacity。

$\mu_\beta = \frac{1}{m}\sum_{i=1}^{m}x_i$ $\sigma_\beta^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_\beta)^2$ $\hat{x_i} = \frac{x_i-\mu_\beta}{\sqrt{\sigma_\beta^2+\epsilon}}$ $y_i = \gamma\hat{x_i}+\beta$

Batch Normalization作用:

允许网络使用较高的learning rate
移除或使用较低的dropout
降低L2权重衰减系数
取消Local Response Normalization层
减少图像扭曲的使用

May 25th, 2017

something useful

1. init

ArrayList<Integer> list = new ArrayList<>(Arrays.asList(0,1,2));

// HashMap key1
HashMap<String, Integer> map = new HashMap<>();
map.put(list.toString(), 0);

// HashMap key2
final class Bigram {
    private final String word1, word2;
    public Bigram(String word1, String word2) {
        this.word1 = word1;
        this.word2 = word2;
    }

    public String getWord1() {
        return word1;
    }
    public String getWord2() {
        return word2;
    }

    @Override
    public int hashCode() {
        return word1.hashCode() ^ word2.hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        return (obj instanceof Bigram) && ((Bigram) obj).word1.equals(word1)
                                       && ((Bigram) obj).word2.equals(word2);
    }
}

2. transformation

int y = Integer.parseInt("40");
String y = String.valueOf(40);

char[] ret = (String).toCharArray();
String ret = String.valueOf(char[]);

String[] ret = (String).split(...);
String ret = (String[]).join(...);

import java.util.stream.*;
// list -> Integer[]
Integer[] arr_integer = list.toArray(new Integer[list.size()]);
// list -> int[]
int[] arr_int = list.stream().mapToInt(i->i).toArray();
// int[] -> Integer[]
Integer[] arr_integer = Arrays.stream(arr_int).boxed().toArray(Integer[]::new);
// Integer[] -> list
ArrayList<Integer> list = new ArrayList<Integer>(Arrays.asList(arr_integer));
// int[] -> list
ArrayList<Integer> list = Arrays.stream(arr_int).boxed().collect(Collectors.toCollection(ArrayList::new));

// list -> list
list.stream()...collect(Collectors.toCollection(ArrayList::new));
// list -> int[]
list.stream()...toArray();
// list -> Integer[]
list.stream()...toArray(Integer[]::new);
// int[] -> int[]
Arrays.stream(arr_int)...toArray();
// int[] -> list
Arrays.stream(arr_int).boxed()...collect(Collectors.toCollection(ArrayList::new));

3. BigInteger

Integer.MAX_VALUE = 2147483647;

BigInteger must support values in the range -2^Integer.MAX_VALUE (exclusive) to +2^Integer.MAX_VALUE (exclusive).

import java.math.BigInteger;
BigInteger A = new BigInteger(String.valueOf(a));

4. tuple pairs

import org.apache.commons.lang3.tuple.*;

ImmutableTriple t = new ImmutableTriple(l, m, r);
ImmutablePair p = new ImmutablePair(l, r);

class Pair<L, R>{
    L left;
    R right;
    Pair(L left, R right){
        this.left = left;
        this.right = right;
    }
}

5. lambda

Collections.sort(list, (a, b) -> {
    if(a.left == b.left)
        return a.right - b.right;
    else
        return a.left - b.left;
});

// 2D array sort
Arrays.sort(cir, new Comparator<float[]>() {
    public int compare(float[] a, float[] b) {
        return Float.compare(a[2], b[2]);
    }
});
Arrays.sort(array, Comparator.comparingDouble(a -> a[1]));

6. map, reduce, filter

List<Integer> map_num = list.stream().map(num -> num * 2).collect(Collectors.toList());
     
Optional<Integer> sum = list.stream().reduce((a, b) -> a + b);
sum.orElseGet(() -> 0);
     
List<Integer> filter_num = list.stream().filter(num -> num % 2 == 0).collect(Collectors.toList());

May 19th, 2017

最近公共祖先 lca

最一般的算法是DFS(DFS本是深度优先搜索，在这里姑且把深度优先遍历也叫做DFS，其实是一种不严谨的说法)。算法很简单，从根节点DFS一遍，按DFS层数k给每个节点标上深度deep[i]=k。然后从U点DFS到V点，找到后回溯，在回溯的路径上找到一个deep[i]最小的节点即为LCA。

import java.io.*;
import java.util.*;

public class Main {
    static int N = (int)1e5 + 5;
    static int[] depth;
    static int[][] fa; 	// father array
    static ArrayList<ArrayList<Integer>> es; // 树的链表表示

    static void dfs(int u){	// 由es更新fa
        for(int i = 0; fa[u][i] != 0; i++)
            fa[u][i + 1] = fa[fa[u][i]][i];
        for (int v : es.get(u)) {
            if (fa[u][0] == v)
                continue;
            depth[v] = depth[u] + 1;
            fa[v][0] = u;
            dfs(v);
        }
    }

    static int lca(int _x, int _y){ // 求lca
        if (depth[_x] < depth[_y])
            return lca(_y, _x);
        int x = _x, y = _y;
        int d = depth[x] - depth[y];
        for (int i = 16; i > -1; --i)
            if ((d & (1 << i)) != 0)
                x = fa[x][i];
        if (x == y)
            return x;
        for (int i = 16; i > -1; --i)
            if (fa[x][i] != fa[y][i]) {
                x = fa[x][i];
                y = fa[y][i];
            }
        return fa[x][0];
    }

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        int T = in.nextInt();
        ArrayList<String> res = new ArrayList<>();
        while (T-- > 0){
            depth = new int[N];
            fa = new int[N][20];
            es = new ArrayList<>(N);

            int n = in.nextInt();
            int m = in.nextInt();

            for(int i = 0; i <= n; i++){
                es.add(new ArrayList());
            }

            for(int i = 1; i < n; i++){
                int u = in.nextInt();
                int v = in.nextInt();
                es.get(u).add(v);
                es.get(v).add(u);
            }
            depth[1] = 0;
            dfs(1);

            while (m-- != 0){
                int u = in.nextInt();
                int v = in.nextInt();
                
                int t1 = lca(u, v);
            }
        }
    }
}

当查询量很大时，可采用Tarjan(离线)算法。Tarjan算法的基本思路：

任选一个点为根节点，从根节点开始。
遍历该点u所有子节点v，并标记这些子节点v已被访问过。
若是v还有子节点，返回2，否则下一步。
合并v到u上。
寻找与当前点u有询问关系的点v。
若是v已经被访问过了，则可以确认u和v的最近公共祖先为v被合并到的父亲节点a。

May 10th, 2017

Java String

java String

May 7th, 2017

Java HashSet工作原理及实现

java HashSet, LinkedHashSet, TreeSet