1317 字
7 分钟
次浏览
Kullback–Leibler divergence

0. 前言#

我们有2个分布 PPQQ, 如何比较二者之间的差异性? 在数理统计上, K-L 散度是一个常用的方法.

1. 定义#

1.1 离散版本#

For discrete probability distributions PP and QQ defined on the same sample space X\mathcal {X} .

DKL(P  Q)=xXP(x) log(P(x)Q(x))D_{KL}(P \ ||\ Q) = \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {Q(x)})

which is equivalent to

DKL(P  Q)= xXP(x) log(Q(x)P(x))D_{KL}(P \ ||\ Q) = - \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{Q(x)} {P(x)})

1.2 连续版本#

DKL(P  Q)=xXp(x) log(p(x)q(x)) dxD_{KL}(P \ ||\ Q) = \int_{x \in \mathcal {X}} p(x) \ log(\frac{p(x)} {q(x)}) \ dx

2. 理解#

2.1 公式上#

有人会把KL散度理解为一种”距离”,不过”距离”需要满足以下几个性质

  • 非负性 : 满足

证明

DKL(P  Q)= xXP(x) log(Q(x)P(x))>= xXlog(P(x)  Q(x)P(x)) (凸函数:E(f(x))>=f(E(x)))= xXlog(Q(x))>= log(xXQ(x)) (凸函数:Jensen不等式)=0\begin{align*} D_{KL}(P \ ||\ Q) &= - \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{Q(x)} {P(x)}) \\ &>= - \ \sum_{x \in \mathcal {X}} log(P(x) \ * \ \frac{Q(x)} {P(x)}) \ (凸函数:E(f(x)) >= f(E(x)))\\ &= - \ \sum_{x \in \mathcal {X}} log(Q(x)) \\ &>= - \ log (\sum_{x \in \mathcal {X}} Q(x) ) \ (凸函数:Jensen不等式)\\ &= 0 \end{align*}
  • 同一性 : 满足

证明

DKL(P  P)= xXP(x) log(P(x)P(x))=0D_{KL}(P \ ||\ P) = \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {P(x)}) = 0
  • 对称性 : 不满足
DKL(P  Q)= xXP(x) log(P(x)Q(x))DKL(Q  P)= xXQ(x) log(Q(x)P(x))\begin{align*} D_{KL}(P \ ||\ Q) &= \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {Q(x)}) \\ &\neq\\ D_{KL}(Q \ ||\ P) &= \ \sum_{x \in \mathcal {X}} Q(x) \ log(\frac{Q(x)} {P(x)}) \\ \end{align*}

一般来说不等, 所以对称性不满足

  • 三角不等式 : 不满足

假设有 P Q R 三个分布, 探究 DKL(P  R)D_{KL}(P \ ||\ R)DKL(P  Q)D_{KL}(P \ ||\ Q)DKL(Q  R)D_{KL}(Q \ ||\ R) 的关系。

DKL(P  R)= xXP(x) log(P(x)R(x))= xXP(x) log(P(x)R(x)Q(x)Q(x))= xXP(x) log(P(x)Q(x)Q(x)R(x))=DKL(P  Q)+ xXP(x) log(Q(x)R(x))\begin{align*} D_{KL}(P \ ||\ R) &= \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {R(x)}) \\ &= \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {R(x)} * \frac{Q(x)} {Q(x)} ) \\ &= \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{P(x)} {Q(x)} * \frac{Q(x)} {R(x)} ) \\ &= D_{KL}(P \ ||\ Q) + \ \sum_{x \in \mathcal {X}} P(x) \ log(\frac{Q(x)} {R(x)} ) \\ \end{align*}

那就需要看后者

xXP(x) log(Q(x)R(x))\sum_{x \in \mathcal {X}} P(x) \ log(\frac{Q(x)} {R(x)} )

DKL(Q  R)D_{KL}(Q \ ||\ R)

之间的大小关系, 但是很遗憾, 二者大小无法判定. 因此有可能出现以下情况, 所以三角不等式不满足.

DKL(P  R)>DKL(P  Q)+DKL(Q  R)D_{KL}(P \ ||\ R) > D_{KL}(P \ ||\ Q) + D_{KL}(Q \ ||\ R)
NOTE

因此称其为“距离”是不合适的, 充其量只能说其可以度量两个分布之间的差异性。

2.2 从熵的角度#

“熵”通常指的是香农熵(Shannon entropy). 原来是信息论里边的东西, 其公式大家很熟悉:

H(X)=xXP(x) log 1P(x)H(X) = \sum_{x \in \mathcal {X}} P(x) \ log\ \frac{1} {P(x)}
NOTE

如果你不知道这个公式为什么是这样, 而不是那样, 可以看我在 B站 发的视频 : https://www.bilibili.com/video/BV1kg411u7RP

然后来看以下推导:

I(X,Y)=DKL(p(x,y)  p(x) × p(y))(定义)= x,ypxy log(pxypx×py)= x,ypxy log(pxypx) x,ypxy log py= x,ypxy log p(yx) y (xpxy) log py= x,ypxpyx log p(yx) y py log py= xpx ypyx log p(yx) y py log py= xpx H(yX=x) y py log py=H(YX)+H(Y)=H(Y)H(YX)\begin{align*} I(X,Y) &= D_{KL}(p(x,y) \ ||\ p(x)\ \times \ p(y)) (定义)\\ &= \ \sum_{x , y} p_{xy} \ log(\frac{p_{xy}} {p_x \times p_y}) \\ & = \ \sum_{x , y} p_{xy} \ log(\frac{p_{xy}} {p_x}) - \ \sum_{x , y} p_{xy} \ log \ p_{y} \\ & = \ \sum_{x , y} p_{xy} \ log \ p(y|x) - \ \sum_{y} \ (\sum_{x} p_{xy}) \ log \ p_{y} \\ & = \ \sum_{x , y} p_{x}p_{y|x} \ log \ p(y|x) - \ \sum_{y} \ p_{y} \ log \ p_{y} \\ & = \ \sum_{x} p_{x} \ \sum_{y } p_{y|x} \ log \ p(y|x) - \ \sum_{y} \ p_{y} \ log \ p_{y} \\ & = - \ \sum_{x } p_{x} \ H(y|X=x) - \ \sum_{y} \ p_{y} \ log \ p_{y} \\ & = -H(Y|X) + H(Y) \\ & = H(Y) - H(Y|X) \\ \end{align*}

其中, I(X,Y)I(X,Y) 称为随机变量 X 与 Y 的互信息量, H(Y  X)H(Y\ |\ X) 表示在已知随机变量X的情况下,Y的熵,也称条件熵.

假设 P=p(x,y),Q=p(x)p(y)P = p(x,y), Q = p(x) * p(y) 我们从 I(X,Y)=DKL(P  Q)=H(Y)H(YX)I(X,Y) = D_{KL}(P\ || \ Q) = H(Y) - H(Y|X) 的角度来看, 当 I(X,Y)=0I(X,Y) = 0 就是想说 H(Y)H(YX)=0H(Y) - H(Y|X) = 0

NOTE

表明已知信息 X , 仍然有 H(YX)=H(Y)H(Y\|X) = H(Y), 即 X 对 Y 的熵降低无任何作用 <=> X 和 Y 独立 <=> P = Q

这样看来, minimize KL(P,Q)minimize \ KL(P,Q) 就是想让二者从信息量上尽量的相近

3. 应用#

DKL(P  Q)=xXP(x) log P(x)Q(x)=xXP(x) log P(x)xXP(x) log Q(x)=H(P)+xXP(x) log 1Q(x)=H(P)+H(P,Q)\begin{align*} D_{KL}(P \ ||\ Q) &= \sum_{x \in \mathcal {X}} P(x) \ log \ \frac{P(x)} {Q(x)} \\ &= \sum_{x \in \mathcal {X}} P(x) \ log \ P(x) - \sum_{x \in \mathcal {X}} P(x) \ log \ Q(x) \\ &= - H(P) + \sum_{x \in \mathcal {X}} P(x) \ log \ \frac{1}{Q(x)} \\ &= - H(P) + H(P,Q) \end{align*}

这里 H(P,Q) 称为 分布 P 和 分布 Q 的交叉熵, 通常我们在机器学习或者深度学习中, 可以把 P 分布理解为真实的概率分布(未知,但是固定) , 因此 - H(P) 就是个常数 ; Q 为我们模型输出的概率分布, 所以可以通过 minimize H(P,Q)minimize \ H(P,Q) 去等价 minimize DKL(P  Q)minimize \ D_{KL}(P \ ||\ Q) . 即 交叉熵 损失函数.

4. 补充#

KL divergence between two multivariate Gaussian distributions.

Probabilty density function of multivariate Normal distribution is given by:

p(x)=1(2π)k/2Σ1/2exp(12(xμ)TΣ1(xμ))p(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

假设2个分布分别为 N(μp,Σp)\mathcal{N}(\boldsymbol{\mu_p},\,\Sigma_p) N(μq,Σq)\mathcal{N}(\boldsymbol{\mu_q},\,\Sigma_q) , 其中 μ\mukk 维 列向量.

DKL(pq)=Ep[log(p)log(q)]=Ep[12logΣqΣp12(xμp)TΣp1(xμp)+12(xμq)TΣq1(xμq)]=12Ep[logΣqΣp]12Ep[(xμp)TΣp1(xμp)]+12Ep[(xμq)TΣq1(xμq)]=12logΣqΣp12Ep[(xμp)TΣp1(xμp)]+12Ep[(xμq)TΣq1(xμq)]\begin{aligned} D_{KL}(p||q) & = \mathbb{E}_p\left[\log(p) - \log(q)\right] \newline & = \mathbb{E}_p\left[\frac{1}{2}\log\frac{|\Sigma_q|}{|\Sigma_p|} - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}(\mathbf{x}-\boldsymbol{\mu_p}) + \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\mathbf{x}-\boldsymbol{\mu_q})\right] \newline & = \frac{1}{2}\mathbb{E}_p\left[\log\frac{|\Sigma_q|}{|\Sigma_p|}\right] - \frac{1}{2}\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}(\mathbf{x}-\boldsymbol{\mu_p})\right] + \frac{1}{2}\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\mathbf{x}-\boldsymbol{\mu_q})\right] \newline & = \frac{1}{2}\log\frac{|\Sigma_q|}{|\Sigma_p|} - \frac{1}{2}\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}(\mathbf{x}-\boldsymbol{\mu_p})\right] + \frac{1}{2}\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\mathbf{x}-\boldsymbol{\mu_q})\right] \end{aligned}

其中 (xμp)TΣp1(xμp)(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}(\mathbf{x}-\boldsymbol{\mu_p}) 是一个实数. 所以可以重新写为 :

tr{(xμp)TΣp1(xμp)}tr \left\{(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}(\mathbf{x}-\boldsymbol{\mu_p})\right\}

其中 tr{} 表示 trace operator , 利用 trace trick (轮换性) , 可以将上式修改为:

tr{(xμp)(xμp)TΣp1}tr \left\{(\mathbf{x}-\boldsymbol{\mu_p})(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}\right\}

于是第2项可修改为:

12Ep[tr{(xμp)(xμp)TΣp1}]\frac{1}{2}\mathbb{E}_p\left[tr\left\{(\mathbf{x}-\boldsymbol{\mu_p})(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}\right\}\right]

然后, 将 expectation 和 trace 交换位置, 且 Σp1\Sigma_p^{-1} 是常数矩阵:

=12tr{Ep[(xμp)(xμp)TΣp1]}=12tr{Ep[(xμp)(xμp)T]Σp1}=12tr{ΣpΣp1}=12tr{Ik}=k2\begin{aligned} & = \frac{1}{2}tr\left\{\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_p})(\mathbf{x}-\boldsymbol{\mu_p})^T\Sigma_p^{-1}\right]\right\} \newline & = \frac{1}{2}tr\left\{\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_p})(\mathbf{x}-\boldsymbol{\mu_p})^T\right]\Sigma_p^{-1}\right\} \newline & = \frac{1}{2}tr\left\{\Sigma_p\Sigma_p^{-1}\right\} \newline & = \frac{1}{2}tr\left\{I_k\right\} \newline & = \frac{k}{2} \end{aligned}

而第3项(证明在最后) :

Ep[(xμq)TΣq1(xμq)]=(μpμq)TΣq1(μpμq)+tr{Σq1Σp}\mathbb{E}_p\left[(\mathbf{x}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\mathbf{x}-\boldsymbol{\mu_q})\right] = (\boldsymbol{\mu_p}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\boldsymbol{\mu_p}-\boldsymbol{\mu_q}) + tr\left\{\Sigma_q^{-1}\Sigma_p\right\}

于是:

DKL(pq)=12[logΣqΣpk+(μpμq)TΣq1(μpμq)+tr{Σq1Σp}]D_{KL}(p||q) = \frac{1}{2}\left[\log\frac{|\Sigma_q|}{|\Sigma_p|} - k + (\boldsymbol{\mu_p}-\boldsymbol{\mu_q})^T\Sigma_q^{-1}(\boldsymbol{\mu_p}-\boldsymbol{\mu_q}) + tr\left\{\Sigma_q^{-1}\Sigma_p\right\}\right]

qN(0,I)q \sim \mathcal{N}(0,\,I) :

DKL(pq)=12[μpTμp+tr{Σp}klogΣp]D_{KL}(p||q) = \frac{1}{2}\left[\boldsymbol{\mu_p}^T\boldsymbol{\mu_p} + tr\left\{\Sigma_p\right\} - k - \log|\Sigma_p|\right]

关于第3项的证明:

Exp(x)[(xμq)Σq1(xμq)]=Exp(x)[Tr((xμq)Σq1(xμq))]=Exp(x)[Tr(Σq1(xμq)(xμq))]=Tr(Σq1Exp(x)[(xμq)(xμq)])=Tr(Σq1Exp(x)[xxμqxxμq+μqμq])=Tr(Σq1(Σp+μpμpμqμpμpμq+μqμq))=Tr(Σq1Σp+Σq1(μpμq)(μpμq))=Tr(Σq1Σp)+(μpμq)Σq1(μpμq)\begin{equation}\begin{aligned} \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right]=&\,\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\text{Tr}\left((\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right)\right]\\ =&\,\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\right)\right]\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[(\boldsymbol{x}-\boldsymbol{\mu}_q)(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\right]\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\boldsymbol{x}\boldsymbol{x}^{\top}-\boldsymbol{\mu}_q\boldsymbol{x}^{\top} - \boldsymbol{x}\boldsymbol{\mu}_q^{\top} + \boldsymbol{\mu}_q\boldsymbol{\mu}_q^{\top}\right]\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\left(\boldsymbol{\Sigma}_p + \boldsymbol{\mu}_p\boldsymbol{\mu}_p^{\top}-\boldsymbol{\mu}_q\boldsymbol{\mu}_p^{\top} - \boldsymbol{\mu}_p\boldsymbol{\mu}_q^{\top} + \boldsymbol{\mu}_q\boldsymbol{\mu}_q^{\top}\right)\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p + \boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p\right) + (\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)\\ \end{aligned}\end{equation}

注意到当 μq=μp,Σq=Σp\boldsymbol{\mu}_q=\boldsymbol{\mu}_p,\boldsymbol{\Sigma}_q=\boldsymbol{\Sigma}_p, 上式就是 nn , 对应正态分布的熵.

Reference#

[1] https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[2] https://zh.wikipedia.org/zh-hans/%E4%BA%92%E4%BF%A1%E6%81%AF

[3] https://en.wikipedia.org/wiki/Entropy_(information_theory)

[4] https://mr-easy.github.io/2020-04-16-kl-divergence-between-2-gaussian-distributions/

[5] https://kexue.fm/archives/8512

Kullback–Leibler divergence
https://xuchenhui.cc/posts/2024-04-19-kullback-leibler-divergence/
作者
CHENHUI
发布于
2024-04-19
许可协议
CC BY-NC-SA 4.0
📖 目录