Entropic regularization
The log-sum-exp trick
Gradient descent, coordinate descent
The Iterated Proportional Fitting Procedure (IPFP)
Galichon, Optimal Transport Methods in Economics, Ch. 7.3
Peyré, Cuturi, Computational Optimal Transport, Ch. 4.
Consider the problem
\begin{align*} \max_{\pi\in\mathcal{M}\left( p,q\right) }\sum_{ij}\pi_{ij}\Phi_{ij}-\sigma\sum_{ij}\pi_{ij}\ln\pi_{ij} \end{align*}where $\sigma>0$. The problem coincides with the optimal assignment problem when $\sigma=0$. When $\sigma\rightarrow+\infty$, the solution to this problem approaches the independent coupling, $\pi_{ij}=p_{i}q_{j}$.
Later on, we will provide microfoundations for this problem, and connect it with a number of important methods in economics (BLP, gravity model, Choo-Siow...). For now, let's just view this as an extension of the optimal transport problem.
We shall compute this problem using Python libraries that we have already met with. Let us start loading them.
import numpy as np
import os
import pandas as pd
import time
import scipy.sparse as spr
#!python -m pip install -i https://pypi.gurobi.com gurobipy ## only if Gurobi not here
import gurobipy as grb
Note that in the above, Gurobi is for benchmark purposes with the case $\sigma=0$, but is not suited to compute the nonlinear optimization problem above.
Now, let's load up the affinitymatrix.csv
, Xvals.csv
and Yvals.csv
that you will recall from the previous block. We will work on a smaller population, with nbX
types of men and nbY
types of women.
nbX = 5
nbY = 3
tol = 1e-9
maxite = 1e+06
#thepath = os.path.join(os.getcwd(),'data_mec_optim/marriage_personality-traits/')
thepath = 'https://raw.githubusercontent.com/math-econ-code/mec_optim_2021-01/master/data_mec_optim/marriage_personality-traits/'
data_X = pd.read_csv(thepath + "Xvals.csv")
data_Y = pd.read_csv(thepath + "Yvals.csv")
affdf = pd.read_csv(thepath + "affinitymatrix.csv")
nbcar = 10
affmat = affdf.iloc[0:nbcar,1:nbcar+1].to_numpy()
sdX = data_X.std().to_numpy()
sdY = data_Y.std().to_numpy()
mX = data_X.mean().to_numpy()
mY = data_Y.mean().to_numpy()
Xvals = ((data_X-mX)/sdX).to_numpy()
Yvals = ((data_Y-mY)/sdY).to_numpy()
nobs = Xvals.shape[0]
Phi = (Xvals @ affmat @ Yvals.T)[:nbX,:nbY]
obj = Phi.flatten()
p = 1/nbX * np.ones(nbX)
q = 1/nbY * np.ones(nbY)
nrow = min(8, nbX) # number of rows to display
ncol = min(8, nbY) # number of cols to displayc
As a warm-up, let us compute as in the previous lecture the solution to the problem for $\sigma=0$ that we can compute with Gurobi:
ptm = time.time()
A1 = spr.kron(np.ones((1, nbY)), spr.identity(nbX))
A2 = spr.kron(spr.identity(nbY), np.ones((1, nbX)))
A = spr.vstack([A1, A2])
obj = Phi.flatten(order = 'F') # flatten order is important
rhs = np.hstack([p,q])
m = grb.Model('marriage')
x = m.addMVar(len(obj), name='couple')
m.setObjective(obj @ x, grb.GRB.MAXIMIZE)
m.addConstr(A @ x == rhs, name="Constr")
m.setParam( 'OutputFlag', False ) #quiet output
m.optimize()
diff = time.time() - ptm
print(f'Time elapsed (Gurobi) = {diff} s.')
if m.status == grb.GRB.Status.OPTIMAL:
val_gurobi = m.objval
x = m.getAttr('x')
x = np.array(x).reshape([nbX, nbY])
pi = m.getAttr('pi')
u_gurobi = pi[:nbX]
v_gurobi = pi[nbX:nbX + nbY]
print(f"Value of the problem (Gurobi) = {val_gurobi}")
print(np.subtract(u_gurobi[:nrow], u_gurobi[nrow - 1]))
print(np.add(v_gurobi[:ncol], u_gurobi[nrow - 1]))
print('*************************')
Time elapsed (Gurobi) = 0.009987354278564453 s. Value of the problem (Gurobi) = 0.41095324822187485 [-0.63237262 -0.57124993 1.634949 -1.24417523 0. ] [0.62598494 0.61304671 0.48153736] *************************
Let's compute the dual by the minimax approach. We have
\begin{align*} \max_{\pi\geq0}\min_{u,v}\sum_{ij}\pi_{ij}\left( \Phi_{ij}-u_{i}-v_{j}% -\sigma\ln\pi_{ij}\right) +\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}% \end{align*}thus
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\max_{\pi\geq0}\sum_{ij}% \pi_{ij}\left( \Phi_{ij}-u_{i}-v_{j}-\sigma\ln\pi_{ij}\right) \end{align*}By FOC in the inner problem, one has $\Phi_{ij}-u_{i}-v_{j}-\sigma\ln \pi_{ij}-\sigma=0,$thus
\begin{align*} \pi_{ij}=\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}-\sigma}{\sigma}\right) \end{align*}and $\pi_{ij}\left( \Phi_{ij}-u_{i}-v_{j}-\sigma\ln\pi_{ij}\right) =\sigma\pi_{ij}$, thus the dual problem is
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\sigma\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}-\sigma}{\sigma}\right) . \end{align*}After replacing $v_{j}$ by $v_{j}+\sigma$, the dual is
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\sigma\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) -\sigma. \tag{V1} \end{align*}Claim: the problem is equivalent to
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\sigma\log\sum_{i,j} \exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) \tag{V2} \end{align*}
Indeed, let us go back to the minimax expression
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\max_{\pi\geq0}\sum_{ij}\pi_{ij}\left( \Phi_{ij}-u_{i}-v_{j}-\sigma\ln\pi_{ij}\right) \end{align*}we see that the solution $\pi$ has automatically $\sum_{ij}\pi_{ij}=1$; thus we can incorporate the constraint into
\begin{align*} \min_{u,v}\sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}+\max_{\pi\geq0:\sum_{ij}\pi_{ij}=1}\sum_{ij}\pi_{ij}\left( \Phi_{ij}-u_{i}-v_{j}-\sigma\ln\pi_{ij}\right) \end{align*}which yields the our desired result.
This expression is interesting because, taking any $\hat{\pi}\in M\left( p,q\right)$, it reexpresses as
\begin{align*} \max_{u,v}\sum_{ij}\hat{\pi}_{ij}\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) -\log\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) \end{align*}therefore if the parameter is $\theta=\left( u,v\right)$, observations are $ij$ pairs, and the likelihood of $ij$ is
\begin{align*} \pi_{ij}^{\theta}=\frac{\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma }\right) }{\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) } \end{align*}Hence, our expression will coincide with the maximum likelihood in this model.
Consider
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j} \\ s.t. \quad & \sum_{i,j}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) =1 \end{align*}
It is easy to see that the solutions of this problem coincide with version 2. Indeed, the Lagrange multiplier is forced to be one. In other words,
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}\\ s.t. \quad & \sigma\log\sum_{i,j}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma }\right) =0 \end{align*}Recall that when $\sigma\rightarrow0$, one has
\begin{align*} \sigma\log\left( e^{a/\sigma}+e^{b/\sigma}\right) \rightarrow\max\left( a,b\right) \end{align*}Indeed, letting $m=\max\left( a,b\right)$,
\begin{align*} \sigma\log\left( e^{a/\sigma}+e^{b/\sigma}\right) =m+\sigma\log\left(\exp\left( \frac{a-m}{\sigma}\right) +\exp\left( \frac{b-m}{\sigma}\right)\right), \end{align*} and the argument of the logarithm lies between $1$ and $2$.
This simple remark is actually a useful numerical recipe called the log-sum-exp trick: when $\sigma$ is small, using the formula above to compute $\sigma\log\left( e^{a/\sigma}+e^{b/\sigma}\right)$ ensures the exponentials won't blow up.
Back to the third expression, with $\sigma\rightarrow0$, one has
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}\tag{V3}\\ s.t. & \max_{ij}\left( \Phi_{ij}-u_{i}-v_{j}\right) =0\nonumber \end{align*}This is exactly equivalent with the classical Monge-Kantorovich expression
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}\tag{V3}\\ s.t. & \Phi_{ij}-u_{i}-v_{j}\leq0\nonumber \end{align*}Back to the third expression of the dual, with $\sigma\rightarrow0$, one has
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}\tag{V3}\\ s.t. & \max_{ij}\left( \Phi_{ij}-u_{i}-v_{j}\right) =0\nonumber \end{align*}This is exactly equivalent with the classical Monge-Kantorovich expression
\begin{align*} \min_{u,v} & \sum_{i}u_{i}p_{i}+\sum_{j}v_{j}q_{j}\tag{V3}\\ s.t. & \Phi_{ij}-u_{i}-v_{j}\leq0\nonumber \end{align*}We can compute $\min F\left( x\right)$ by two methods:
Either by gradient descent: $x\left( t+1\right) =x_{t}-\epsilon _{t}\nabla F\left( x_{t}\right) $. (Steepest descent has $\epsilon _{t}=1/\left\vert \nabla F\left( x_{t}\right) \right\vert $.)
Or by coordinate descent: $x_{i}\left( t+1\right) =\arg\min_{x_{i}}F\left( x_{i},x_{-i}\left( t\right) \right)$.
Why do these methods converge? Let's provide some justification. We will decrease $x_{t}$ by $\epsilon d_{t}$, were $d_{t}$ is normalized by $\left\vert d_{t}\right\vert _{p}:=\left( \sum_{i=1}^{n}d_{t}^{i}\right) ^{1/p}=1$. At first order, we have
\begin{align*} F\left( x_{t}-\epsilon d_{t}\right) =F\left( x_{t}\right) -\epsilon d_{t}^{\intercal}\nabla F\left( x_{t}\right) +O\left( \epsilon^{1}\right). \end{align*}We need to maximize $d_{t}^{\intercal}\nabla F\left( x_{t}\right)$ over $\left\vert d_{t}\right\vert _{p}=1$.
For $p=2$, we get $d_{t}=\nabla F\left( x_{t}\right) /\left\vert \nabla F\left( x_{t}\right) \right\vert $
For $p=1$, we get $d_{t}=sign\left( \partial F\left( x_{t}\right)/\partial x^{i}\right) $ if $\left\vert \partial F\left( x_{t}\right) /\partial x^{i}\right\vert =\max_{j}\left\vert \partial F\left( x_{t}\right) /\partial x^{j}\right\vert $, $0$ otherwise.
In our context, gradient descent is
\begin{align*} u_{i}\left( t+1\right) & =u_{i}\left( t\right) -\epsilon\frac{\partial F}{\partial u_{i}}\left( u\left( t\right) ,v\left( t\right) \right) ,\text{ and }\\ v_{j}\left( t+1\right) & =v_{j}\left( t\right) -\epsilon\frac{\partial F}{\partial v_{j}}\left( u\left( t\right) ,v\left( t\right) \right) \end{align*}while coordinate descent is
\begin{align*} \frac{\partial F}{\partial u_{i}}\left( u_{i}\left( t+1\right) ,u_{-i}\left( t\right) ,v\left( t\right) \right) =0,\text{ and } \frac{\partial F}{\partial v_{j}}\left( u\left( t\right) ,v_{j}\left( t+1\right) ,v_{-j}\left( t\right) \right) =0. \end{align*}Gradient of objective function in version 1 of our problem:
\begin{align*} \left( p_{i}-\sum_{j}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) ,q_{j}-\sum_{i}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) \right) \end{align*}Gradient of objective function in version 2
\begin{align*} \left( p_{i}-\frac{\sum_{j}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma }\right) }{\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) },q_{j}-\frac{\sum_{i}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) }{\sum_{ij}\exp\left( \frac{\Phi_{ij}-u_{i}-v_{j}}{\sigma}\right) }\right) \end{align*}Coordinate descent on objective function in version 1:
\begin{align*} p_{i} & =\sum_{j}\exp\left( \frac{\Phi_{ij}-u_{i}\left( t+1\right) -v_{j}\left( t\right) }{\sigma}\right) ,\\ q_{j} & =\sum_{i}\exp\left( \frac{\Phi_{ij}-u_{i}\left( t\right) -v_{j}\left( t+1\right) }{\sigma}\right) \end{align*}that is
\begin{align*} \left\{ \begin{array} [c]{c} u_{i}\left( t+1\right) =\sigma\log\left( \frac{1}{p_{i}}\sum_{j}\exp\left( \frac{\Phi_{ij}-v_{j}\left( t\right) }{\sigma}\right) \right) \\ v_{j}\left( t+1\right) =\sigma\log\left( \frac{1}{q_{j}}\sum_{i}\exp\left( \frac{\Phi_{ij}-u_{i}\left( t\right) }{\sigma}\right) \right) \end{array} \right. \end{align*}this is called the Iterated Fitting Proportional Procedure (IPFP), or Sinkhorn's algorithm.
Coordinate descent on objective function in version 2 does not yield a closed-form expression.
Letting $a_{i}=\exp\left( -u_{i}/\sigma\right) $ and $b_{j}=\exp\left( -v_{j}/\sigma\right) $ and $K_{ij}=\exp\left( \Phi_{ij}/\sigma\right) $, one has $\pi_{ij}=a_{i}b_{j}K_{ij}$, and the procedure reexpresses as
\begin{align*} \left\{ \begin{array} [c]{l}% a_{i}\left( t+1\right) =p_{i}/\left( Kb\left( t\right) \right) _{i}\text{ and }\\ b_{j}\left( t+1\right) =q_{j}/\left( K^{\intercal}a\left( t\right) \right) _{j}. \end{array} \right. \end{align*}Because this algorithm involves matrix operations only, and is naturally suited for parallel computation, GPUs are a tool of choice for addressing is. See chap. 4 of Peyré and Cuturi.
Implementation. Let's implement this algorithm.
A quick note beforehand: Numpy doesn't exactly do what one would expect when broadcasting 1d arrays to 2 arrays (for example when adding vectors and matrices). Be really careful here because things can get messy. We advise you to change all vectors (1d objects) into matrices with a single column to be sure that broadcasting is done right.
Hence we define:
def two_d(X):
return np.reshape(X,(X.size, 1))
Returning to the matrix-IPFP algorithm:
## Matrix IPFP
ptm = time.time()
ite = 0
sigma = 0.1
K = np.exp(Phi/sigma)
B = two_d(np.repeat(1, nbY))
error = tol + 1
while error > tol and ite < maxite:
A = two_d(p/(K @ B).flatten(order='F'))
KA = (A.T @ K)
error = np.max(abs(np.multiply(KA,B.flatten()/q)-1))
B = (q / KA).T
ite = ite + 1
u = - sigma * np.log(A)
v = - sigma * np.log(B)
pi = (K * A) * np.repeat(B, 5, axis = 1).T
val = np.sum(pi * Phi) - sigma * np.sum(pi * np.log(pi))
end = time.time() - ptm
if ite >= maxite:
print('Maximum number of iterations reached in IPFP1.')
else:
print(f'IPFP1 converged in {ite} steps and {end} s.')
print(f'Value of the problem (IPFP1) = {val}')
print(f'Sum(pi*Phi) (IPFP1) = {np.sum(pi*Phi)}')
print('*************************')
IPFP1 converged in 89 steps and 0.004000425338745117 s. Value of the problem (IPFP1) = 0.6045556509904391 Sum(pi*Phi) (IPFP1) = 0.4001284575694894 *************************
To see the benefit of the matrix version, let us recode the same algorithm as above, but in the log-domain, namely iterate over
# log-domain IPFP
sigma = 0.01
ptm = time.time()
ite = 0
v = np.repeat(0, nbY)
mu = - sigma * np.log(p)
nu = - sigma * np.log(q)
error = tol + 1
while error > tol and ite < maxite:
u = mu + sigma * np.log(np.sum(np.exp((Phi - np.repeat(two_d(v), nbX, axis = 1).T)/sigma), axis=1))
KA = np.sum(np.exp((Phi - two_d(u)) / sigma), axis=0)
error = np.max(np.abs(KA * np.exp(-v / sigma) / q - 1))
v = nu + sigma * np.log(KA)
ite = ite + 1
pi = np.exp((Phi - two_d(u) - np.repeat(two_d(v), nbX, axis = 1).T) / sigma)
val = np.sum(pi * Phi) - sigma * np.sum(pi * np.log(pi))
end = time.time() - ptm
if ite >= maxite:
print('Maximum number of iteations reached in IPFP1.')
else:
print(f'IPFP1_logs converged in {ite} steps and {end} s.')
print('Value of the problem (IPFP1_logs) = {val}')
print(f'Sum(pi*Phi) (IPFP1_logs) = {np.sum(pi * Phi)}')
print('*************************')
IPFP1_logs converged in 156 steps and 0.006832122802734375 s. Value of the problem (IPFP1_logs) = {val} Sum(pi*Phi) (IPFP1_logs) = 0.4109531251395014 *************************
We see that the log-domain IPFP, while mathematically equivalent to matrix IPFP, it is noticeably slower.
The matrix IPFPis very fast, partly due to the fact that it involves linear algebra operations. However, it breaks down when $\sigma$ is small; this is best seen taking a log transform and returning to $u^{k}=-\sigma\log a^{k}$ and $v^{k}=-\sigma\log b^{k}$, that is
\begin{align*} \left\{ \begin{array} [c]{l}% u_{i}^{k}=\mu_{i}+\sigma\log\sum_{j}\exp\left( \frac{\Phi_{ij}-v_{j}^{k-1}% }{\sigma}\right) \\ v_{j}^{k}=\zeta_{j}+\sigma\log\sum_{i}\exp\left( \frac{\Phi_{ij}-u_{i}^{k}% }{\sigma}\right) \end{array} \right. \end{align*}where $\mu_{i}=-\sigma\log p_{i}$ and $\zeta_{j}=-\sigma\log q_{j}$.
One sees what may go wrong: if $\Phi_{ij}-v_{j}^{k-1}$ is positive in the exponential in the first sum, then the exponential blows up due to the small $\sigma$ at the denominator. However, the log-sum-exp trick can be used in order to avoid this issue.
Consider
\begin{align*} \left\{ \begin{array} [c]{l}% \tilde{v}_{i}^{k}=\max_{j}\left\{ \Phi_{ij}-v_{j}^{k}\right\} \\ \tilde{u}_{j}^{k}=\max_{i}\left\{ \Phi_{ij}-u_{i}^{k}\right\} \end{array} \right. \end{align*}(the indexing is not a typo: $\tilde{v}$ is indexed by $i$ and $\tilde{u}$ by $j$).
One has
\begin{align*} \left\{ \begin{array} [c]{l}% u_{i}^{k}=\mu_{i}+\tilde{v}_{i}^{k-1}+\sigma\log\sum_{j}\exp\left( \frac {\Phi_{ij}-v_{j}^{k-1}-\tilde{v}_{i}^{k}}{\sigma}\right) \\ v_{j}^{k}=\zeta_{j}+\tilde{u}_{j}^{k}+\sigma\log\sum_{i}\exp\left( \frac {\Phi_{ij}-u_{i}^{k}-\tilde{u}_{j}^{k}}{\sigma}\right) \end{array} \right. \end{align*}and now the arguments of the exponentials are always nonpositive, ensuring the exponentials don't blow up.
Both the matrix version and the log-domain version of the IPFP will break down when $\sigma$ is small, e.g. $\sigma=0.001$ (Try!). However if we modify the second procedure using the log-sum-exp trick, things work again:
# IPFP with log-sum-exp trick
sigma = 0.001
ptm = time.time()
ite = 0
v = np.repeat(0, nbY)
mu = - sigma * np.log(p)
nu = - sigma * np.log(q)
error = tol + 1
uprec = np.NINF
while error > tol and ite < maxite:
vstar = np.max(Phi.T - two_d(v), axis = 0)
u = mu + vstar + sigma * np.log(np.sum(np.exp((Phi - np.repeat(two_d(v), nbX, axis = 1).T -
two_d(vstar))/sigma), axis=1))
error = np.max(abs(u - uprec))
uprec = u
ustar = np.max(Phi - two_d(u), axis = 0)
KA = np.sum(np.exp((Phi - two_d(u) - np.repeat(two_d(ustar), nbX, axis = 1).T) / sigma), axis=0)
v = nu + ustar + sigma * np.log(KA)
ite = ite + 1
pi = np.exp((Phi - two_d(u) - np.repeat(two_d(v), nbX, axis = 1).T) / sigma)
val = np.sum(pi * Phi) - sigma * np.sum(pi * np.log(pi))
end = time.time() - ptm
if ite >= maxite:
print('Maximum number of iteations reached in IPFP1.')
else:
print(f'IPFP1_logs converged in {ite} steps and {end} s.')
print(f'Value of the problem (IPFP1_logs) = {val}')
print(f'Sum(pi*Phi) (IPFP1_logs) = {np.sum(pi *Phi)}')
print('*************************')
IPFP1_logs converged in 315 steps and 0.02502751350402832 s. Value of the problem (IPFP1_logs) = nan Sum(pi*Phi) (IPFP1_logs) = 0.41095352613797476 *************************
C:\Users\antoi\Anaconda3\lib\site-packages\ipykernel_launcher.py:23: RuntimeWarning: divide by zero encountered in log C:\Users\antoi\Anaconda3\lib\site-packages\ipykernel_launcher.py:23: RuntimeWarning: invalid value encountered in multiply