自然语言处理基础 | 5 语义分析

7 Distributional Semantics

分布语义学 (distributional semantics): family of techniques for representing word meaning based on contexts of use.

  • 同义 (synonym): 某语境下意思相同 (automobile & car)
  • 相似 (similarity): 意思相近但不相同 (car & truck)
  • 相关 (relatedness): 词义有关联 (coffee & sugar)

7.1 Vector space models

Model a word by embedding it in a vector space. The representation of a word is a vector of real numbers. We expect much lower dimensions than \(\R^{V}\) (\(V\) 为词汇表大小). Often called Embedding Models or Vector Space Models (VSM)

To produce dense vector representations based on the context/use of words.

Main approaches: count-based, prediction-based, and task-based.

使用余弦相似度 (两个向量的夹角的余弦值) 衡量两个单词的相似度: \[ \cos(u,v) = \frac{\lr{u,v}}{\|u\|\cdot\|v\|}. \]

7.2 Count-based models

Count-based methods:

  • Define a basis vocabulary \(C\) of context words.
  • Define a word window size \(w\).
  • Count the basis vocabulary words occurring w words to the left or right of each instance of a target word in the corpus.
  • Form a vector representation of the target word based on these counts.
  • 原始的计数不够好, 比如 the, of 等无用词汇.

7.2.1 PPMI & TF-IDF

PMI (pointwise mutual information): Do events X and Y co-occur more than if they were independent? \[ \mathrm{PMI}(X,Y) := \log\frac{P(X,Y)}{P(X)P(Y)}. \]

  • Positive PMI 就是把负的 PMI 换成 \(0\).

考虑单词 \(a\) 的上下文表示. 记事件 \(X\) 表示 \(a\) 与上下文单词 \(w\) 共同出现 (单词 \(w\) 统共出现 \(l_w\) 次), 事件 \(Y\) 表示 \(a\) 在训练数据中出现 (训练数据共 \(N\) 个单词), 则 PPMI 为 \[ \Align{ \mathrm{PPMI}(X,Y) &= \max\qty{0,\log\frac { \mathsf{count}(w,a) / N } { l_w/N \times \sum_c \mathsf{count}(c,n)/N } } \\ &= \max\qty{0,\log\frac { N\mathsf{count}(w,a) }{ l_w \sum_c \mathsf{count}(c,n) } }. } \] PPMI 得到了一个 \(\textsf{num target words}\times\textsf{num context words}\) 的稀疏矩阵.

  • 有一个问题: PPMI is biased toward infrequent events, so very rare words have very high PMI values. 解决办法是 Laplace 平滑 (加一).

另一种方法是 TF-IDF, 可以得到一个 \(\textsf{num target words}\times\textsf{num documents}\) 的稀疏矩阵 (word-document co-occurrence matrix).

7.2.2 Dimension reduction

PPMI / TF-IDF 产生的矩阵十分稀疏. 将稀疏向量压缩成稠密向量, 一方面可以简化模型, 加速下游任务; 另一方面可以去掉噪声, 提高推广能力.

Latent Semantic Analysis 是将稀疏的 word-document co-occurrence 矩阵转化为稠密矩阵的方法. 设 PPMI 矩阵 \(A\in\R^{V\times D}\) 的 SVD 分解 \[ A = M \Sigma C\T. \] 其中奇异值矩阵 \(\Sigma\in\R^{V\times D}\), 词表示矩阵 \(M\in\R^{V\times V}\), 文档表示矩阵 \(C\in\R^{D\times D}\). 将绝对值较小的奇异值删掉, 只保留前 \(k\) 个奇异值, 就可以得到 \(A\) 的低秩近似: \[ \hat{A} = \hat{M} \hat{\Sigma} \hat{C}\T. \] Probabilistic Latent Semantic Analysis...

7.3 Prediction-based models

word2vec 模型: predict between every word and its context words.

Continuous bag of words (CBOW), 利用上下文预测当前单词 (类似 Bengio 的语言模型): \[ p(w_n \mid w_{n−2}, w_{n−1}, w_{n+1}, w_{n+2}) = \operatorname{softmax}\!\bqty{ b + \sum_{j\neq n} m_jA_j + \tanh\pqty{ u + \sum_{j\neq n} m_jT_j }W }. \] Skip-gram (Mikolov et al., 2013) 刚好反过来, 根据当前单词预测上下文, \[ p(c \mid w) = \frac1{Z_w} \exp(v_c\T v_w), \] 其中 \(v_c,v_w\) 分别为上下文和当前单词的向量表示, 归一化项 \(Z_w\). 使用 SGD 方法优化 \(v_c,v_w\).

8 Compositional Semantics

8.1 Meaning representations

把一个句子变成一个一阶逻辑语句.

  • “Cats are mammals”: \[ \forall x.\mathsf{cat}(x)\implies\mathsf{mammal}(x). \]

  • “Marie owns a cat”: \[ \exists x.\mathsf{cat}(x) \wedge \mathsf{owns}(\textsf{Marie},x). \]

8.2 CFGs

Given this way of representing meanings, how do we compute meaning representations from sentences?


自然语言处理基础 | 5 语义分析
https://disembo.github.io/Note/Course/fnlp/5/
作者
jin
发布于
2025年4月13日
许可协议