自然语言处理基础 | 5 语义分析
7 Distributional Semantics
分布语义学 (distributional semantics): family of techniques for representing word meaning based on contexts of use.
- 同义 (synonym): 某语境下意思相同 (automobile & car)
- 相似 (similarity): 意思相近但不相同 (car & truck)
- 相关 (relatedness): 词义有关联 (coffee & sugar)
7.1 Vector space models
Model a word by embedding it in a vector space. The representation of a word is a vector of real numbers. We expect much lower dimensions than \(\R^{V}\) (\(V\) 为词汇表大小). Often called Embedding Models or Vector Space Models (VSM)
To produce dense vector representations based on the context/use of words.
Main approaches: count-based, prediction-based, and task-based.
使用余弦相似度 (两个向量的夹角的余弦值) 衡量两个单词的相似度: \[ \cos(u,v) = \frac{\lr{u,v}}{\|u\|\cdot\|v\|}. \]
7.2 Count-based models
Count-based methods:
- Define a basis vocabulary \(C\) of context words.
- Define a word window size \(w\).
- Count the basis vocabulary words occurring w words to the left or right of each instance of a target word in the corpus.
- Form a vector representation of the target word based on these counts.
- 原始的计数不够好, 比如 the, of 等无用词汇.
7.2.1 PPMI & TF-IDF
PMI (pointwise mutual information): Do events X and Y co-occur more than if they were independent? \[ \mathrm{PMI}(X,Y) := \log\frac{P(X,Y)}{P(X)P(Y)}. \]
- Positive PMI 就是把负的 PMI 换成 \(0\).
考虑单词 \(a\) 的上下文表示. 记事件 \(X\) 表示 \(a\) 与上下文单词 \(w\) 共同出现 (单词 \(w\) 统共出现 \(l_w\) 次), 事件 \(Y\) 表示 \(a\) 在训练数据中出现 (训练数据共 \(N\) 个单词), 则 PPMI 为 \[ \Align{ \mathrm{PPMI}(X,Y) &= \max\qty{0,\log\frac { \mathsf{count}(w,a) / N } { l_w/N \times \sum_c \mathsf{count}(c,n)/N } } \\ &= \max\qty{0,\log\frac { N\mathsf{count}(w,a) }{ l_w \sum_c \mathsf{count}(c,n) } }. } \] PPMI 得到了一个 \(\textsf{num target words}\times\textsf{num context words}\) 的稀疏矩阵.
- 有一个问题: PPMI is biased toward infrequent events, so very rare words have very high PMI values. 解决办法是 Laplace 平滑 (加一).
另一种方法是 TF-IDF, 可以得到一个 \(\textsf{num target words}\times\textsf{num documents}\) 的稀疏矩阵 (word-document co-occurrence matrix).
7.2.2 Dimension reduction
PPMI / TF-IDF 产生的矩阵十分稀疏. 将稀疏向量压缩成稠密向量, 一方面可以简化模型, 加速下游任务; 另一方面可以去掉噪声, 提高推广能力.
Latent Semantic Analysis 是将稀疏的 word-document co-occurrence 矩阵转化为稠密矩阵的方法. 设 PPMI 矩阵 \(A\in\R^{V\times D}\) 的 SVD 分解 \[ A = M \Sigma C\T. \] 其中奇异值矩阵 \(\Sigma\in\R^{V\times D}\), 词表示矩阵 \(M\in\R^{V\times V}\), 文档表示矩阵 \(C\in\R^{D\times D}\). 将绝对值较小的奇异值删掉, 只保留前 \(k\) 个奇异值, 就可以得到 \(A\) 的低秩近似: \[ \hat{A} = \hat{M} \hat{\Sigma} \hat{C}\T. \] Probabilistic Latent Semantic Analysis...
7.3 Prediction-based models
word2vec 模型: predict between every word and its context words.
Continuous bag of words (CBOW), 利用上下文预测当前单词 (类似 Bengio 的语言模型): \[ p(w_n \mid w_{n−2}, w_{n−1}, w_{n+1}, w_{n+2}) = \operatorname{softmax}\!\bqty{ b + \sum_{j\neq n} m_jA_j + \tanh\pqty{ u + \sum_{j\neq n} m_jT_j }W }. \] Skip-gram (Mikolov et al., 2013) 刚好反过来, 根据当前单词预测上下文, \[ p(c \mid w) = \frac1{Z_w} \exp(v_c\T v_w), \] 其中 \(v_c,v_w\) 分别为上下文和当前单词的向量表示, 归一化项 \(Z_w\). 使用 SGD 方法优化 \(v_c,v_w\).
8 Compositional Semantics
8.1 Meaning representations
把一个句子变成一个一阶逻辑语句.
“Cats are mammals”: \[ \forall x.\mathsf{cat}(x)\implies\mathsf{mammal}(x). \]
“Marie owns a cat”: \[ \exists x.\mathsf{cat}(x) \wedge \mathsf{owns}(\textsf{Marie},x). \]
8.2 CFGs
Given this way of representing meanings, how do we compute meaning representations from sentences?