This notebook visualizes the geometry between two views called epipolar geometry.
Subjects are covered:
Section 1 is meant as a theoretical underpinning of the methods used. It can be skipped by readers only interested in the algorithm's implementation. Many of the descriptions in this section I have taken, sometimes verbatim, from Multiple View Geometry in Computer Vision by Richard Hartley and Andrew Zisserman, epipolar geometry Wikipedia articles, and Computer Vision Algorithms and Applications by Richard Szeliski.
Why are we interested in these subjects?
Structure and depth are inherently ambiguous from single views.
Understanding stereo geometry will allow us to overcome this ambiguity and estimate many useful values, such as:
Further reading
For a comprehensive treatment of epipolar geometry, see chapters nine to eleven of the textbook Multiple view geometry in computer vision by Richard Hartley and Andrew Zisserman. For this subject, I would not recommend video lectures. The notations in many lectures was inconsistent, and I eventually found it was best understood through reading and drawing.
Epipolar geometry is the intrinsic geometry between two views. It is independent of scene structure and only depends on the cameras' internal parameters and relative pose. This geometry is essentially the geometry of the intersection of the two camera image planes with a pencil of planes having the baseline as axis. Here, the baseline is the line joining the camera centers. (images taken from Multiple View Geometry in Computer Vision).
$ \mathbf {C,C}'$ - The camera projection centers
$ \mathbf {X}$ - A 3D point
$ \mathbf {x,x'}$ - The 2D images of $X$
$ \mathbf {\pi}$ - The epipolar plane
$ \mathbf {e, e'}$ - The epipoles
$ \mathbf {l, l'}$ - The epipolar lines
$ \mathbf {P, P'}$ - The camera matrices
The fundamental matrix is the algebraic representation of epipolar geometry. That is the above variables and their interrelations are captured in this matrix.
Mostly simply:
An epipolar line is a projection in the second image of the ray from the point $ \mathbf {x}$ through the camera center $ \mathbf {C}$ of the first camera. Thus, there is a map
$$ \mathbf {x \rightarrow l'}$$This mapping is a projective mapping from points to lines, which can be represented by the Fundamental matrix $\mathbf{F}$.
$$ \mathbf{l'} = \mathbf{Fx} $$Because each corresponding point $\mathbf{x'}$ lies on $\mathbf{l'}$ and has a dot product of 0, $\mathbf{F}$ defines the constraint
$$ \mathbf{x'} ^{\mathbf{T}} \mathbf{F} \mathbf{x} = 0$$This derivation is taken from the book Multiple View Geometry in Computer Vision.
The ray back-projected from $\mathbf{x}$ is obtained by solving $\mathbf{PX = x}$. The one-parameter family of solutions is of the form
$$ \mathbf{X}(\lambda) = \mathbf{P^{+} x} + \lambda \mathbf{C}$$Keep in mind these are homogenous coordinates, meaning the last coordinate is rescaled to 1. Essentially, we are choosing a point on a line segment between camera center $\mathbf{C}$ and 3D point $\mathbf{P^{+} x}$. In particular, two 3D points on the ray are $\mathbf{P^{+} x}$ at $\lambda = 0$, and $\mathbf{C}$ at $\lambda = \infty$. Here $\mathbf{P^{+}}$ is the psuedo-inverse of $\mathbf{P}$, and $\mathbf{C}$ is its null-vector, namely the camera center, defined by $\mathbf{PC}=\mathbf{0}$. These two points are imaged by the second camera $\mathbf{P'}$ at $\mathbf{P'}\mathbf{P^{+} x}$ and $\mathbf{P'C}$ respectively. The epipolar line is the line joining these two projeted points, namely $\mathbf{l'} = (\mathbf{P'C})\times (\mathbf{P'}\mathbf{P^{+} x})$. The point $\mathbf{P'C}$ is the epipole in the second image, namely the projection of the first camera center, and may be denoted by $\mathbf{e'}$. Thus, $\mathbf{l'} = [\mathbf{e'}]_{\times} (\mathbf{P'}\mathbf{P^{+}) x} = \mathbf{Fx} $, where $\mathbf{F}$ is the matrix
$$ \mathbf{F} = [\mathbf{e'}]_{\times} (\mathbf{P'}\mathbf{P^{+})} $$Thus formalizing $ \mathbf {x \rightarrow l'}$ to the equation
$$ \mathbf{l'} = \mathbf{Fx}$$Given that any point $w$ on $\mathbf{l'}$ will have the dot product $w \cdot \mathbf{l'} = 0$, we can infer the following: The fundamental matrix satisfies the condition that for any pair of corresponding points $\mathbf{x} \leftrightarrow \mathbf{x'}$ in the two images
$$ \mathbf{x'} ^{\mathbf{T}} \mathbf{F} \mathbf{x} = 0$$The importance of the relation is that it gives a way of characterizing the fundamental matrix without reference to the camera matrices, i.e. only in terms of corresponding image points. This enables $\mathbf{F}$ to be computed from image correspondences alone. To compute matrix $\mathbf{F}$ at least $7$ correspondences are required.
$\mathbf{F^{T}}$ is the fundamental matrix of the pair in the opposite order $(\mathbf{P'}, \mathbf{P})$.
contains epipole $\mathbf{e'}$
The essential matrix is the specialization of the fundamental matrix to the case of normalized image coordinates. Historically, the essential matrix was introduced (by Longuet-Higgins) before the fundamental matrix, and the fundamental matrix may be thought of as the generalization of the essential matrix in which the (inessential) assumption of calibrated cameras is removed. The essential matrix is defined as
$$ \mathbf{E} = \mathbf{K^{'T}FK}$$The relation between the essential matrix and corresponding points is essentially ( ;] ) the same as for the fundamental matrix.
$$ \mathbf{x'} ^{\mathbf{T}} \mathbf{F} \mathbf{x} = 0$$$$ \mathbf{x'} \mathbf{K^{'-T}K^{'T}FKK^{-1}} \mathbf{x} = 0$$$$ \mathbf{y'} \mathbf{K^{'T}FK} \mathbf{y} = 0 $$$$ \mathbf{y'} \mathbf{E} \mathbf{y} = 0 $$Where $\mathbf{y}$ and $\mathbf{y'}$ are the normalized coordinates. These can be thought of as the image plane coordinates for a camera with the identity matrix as camera intrinsics.
We will now visualize epipolar geometry by plotting the epipolar lines for stereo point correspondences.
For this demonstration, we use a stereo setup from notebook 1.
from notebook_functions import (init_3d_plot,
plot_chessboard,
plot_camera_wireframe,
plot_picture,
project_points_to_picture,
object_points,
images,
get_stereo_setup_with_correspondences,
triangulate)
import ipyvolume as ipv
import numpy as np
import cv2
import matplotlib.pyplot as plt
images, extrinsics, cam_centers, intrinsics, match_coords, object_points = get_stereo_setup_with_correspondences()
init_3d_plot()
plot_chessboard(object_points)
vis_scale = 5
for idx, (cam_center, extrinsic, image) in enumerate(zip(cam_centers, extrinsics, images)):
inv_extrinsic = np.linalg.inv(extrinsic)
inv_intrinsics = np.linalg.inv(intrinsics)
plot_camera_wireframe(cam_center, vis_scale, inv_extrinsic)
plot_picture(image, inv_extrinsic, vis_scale)
ipv.show()
We will calculate the fundamental matrix using $ \mathbf{F} = [\mathbf{e'}]_{\times} (\mathbf{P'}\mathbf{P^{+})} $ and $\mathbf{e'} = \mathbf{P'C}$.
Where $ [T_{\times}] = \begin{pmatrix} 0 & -t_z & t_y\\ t_z & 0 & -t_x\\ -t_y & t_x & 0 \end{pmatrix} $
extrinsic_1, extrinsic_2 = extrinsics
extrinsic_1_orig = extrinsic_1
extrinsic_2_orig = extrinsic_2
cam_center_1, cam_center_2 = cam_centers
cam_center_1_orig = cam_center_1
cam_center_2_orig = cam_center_2
proj_1 = intrinsics[:3, :3] @ extrinsic_1[:3, :4]
proj_2 = intrinsics[:3, :3] @ extrinsic_2[:3, :4]
pinv_proj_1 = np.linalg.pinv(proj_1)
pinv_proj_2 = np.linalg.pinv(proj_2)
e_1 = proj_1 @ cam_center_2[:4]
e_2 = proj_2 @ cam_center_1[:4]
e_1 = e_1 / e_1[2]
e_2 = e_2 / e_2[2]
x, y, z = e_2
cross_e_2 = np.array([[ 0, -z, y],
[ z, 0, -x],
[-y, x, 0]])
F = cross_e_2 @ (proj_2 @ pinv_proj_1)
print(f'Rank of F is very near 2 (ignoring numerical rounding errors): {np.isclose(np.linalg.svd(F)[1][2], 0)}')
print(f'Fe≈0 and F trans e\'≈ 0: {np.all(F @ e_1 < 0.0001), np.all(F.T @ e_2 < 0.0001)}')
The epipolar lines will now be drawn using the equation $ \mathbf{l'} = \mathbf{Fx}$ and $ \mathbf{l} = \mathbf{F^Tx'}$.
def draw_homogenous_line(image, line):
""" Draws a line represented by homogenous coordinates onto an image.
Args:
image (np.ndarray): An image array.
line (np.ndarray): A 2D line represented by homgenous coordinates.
Returns:
image (np.ndarray): The image with the line drawn on it.
"""
height, width, _ = image.shape
a, b, c = line
y_at_x0 = int(-c/b)
y_at_xwidth = int((-a * width - c)/b)
image = cv2.line(image, (0, y_at_x0), (width, y_at_xwidth), [255, 255, 255], 5)
return image
def plot_epipolar_lines(image_1, image_2, points_1, points_2, F):
""" Plots point correspondences along with epipolar lines.
Args:
image_{1|2} (np.ndarray): The stereo images.
points_{1|2} (np.ndarray): The stereo point correspondences.
F (no.ndarray): The fundamental matrix.
"""
for point_1, point_2 in zip(points_1, points_2):
line_1 = F.T @ point_2[:3]
line_2 = F @ point_1[:3]
image_1 = cv2.circle(image_1, point_1[:2].astype(int), 15, [255,255,255], -1)
image_2 = cv2.circle(image_2, point_2[:2].astype(int), 15, [255,255,255], -1)
image_1 = draw_homogenous_line(image_1, line_1)
image_2 = draw_homogenous_line(image_2, line_2)
image_concat = cv2.hconcat([image_1, image_2])
image_concat = cv2.cvtColor(image_concat, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(20,20))
plt.imshow(image_concat)
plt.show()
im_1_points = match_coords[0]
im_2_points = match_coords[1]
image_1 = images[0]
image_2 = images[1]
plot_epipolar_lines(image_1.copy(), image_2.copy(), im_1_points, im_2_points, F)
The 8 point algorithm is a straightforward approach for calculating the Fundamental matrix $\mathbf{F}$.
*-See the appendix for an intuitive explanation and proof of why:
$
\begin{align}
\mathbf{x'} ^{\mathbf{T}} \mathbf{F} \mathbf{x} &= 0 \\
\begin{bmatrix}
x' & y' & w'
\end{bmatrix}
\begin{bmatrix}
f_1 & f_2 & f_3 \\
f_4 & f_5 & f_6 \\
f_7 & f_8 & f_9
\end{bmatrix}
\begin{bmatrix}
x \\
y \\
w
\end{bmatrix} &= 0 \\
x' x f_1 + x' y f_2 + x' w f_3 \
&= \begin{bmatrix} 0 \end{bmatrix} \end{align} $
The last equation is a linear homogeneous equation of form $\mathbf{A}b = 0$. Adding $7$ more constraints to $\mathbf{A}$ allows us to compute $b$. In this case, $b$ is the rolled-out fundamental matrix $\mathbf{F}$.
# Linear solution
constraint_matrix = list()
for point_1, point_2 in zip(im_1_points, im_2_points):
x1, y1, w1, _ = point_1
x2, y2, w2, _ = point_2
constraint = [x2*x1, x2*y1, x2*w1, y2*x1, y2*y1, y2*w1, w2*x1, w2*y1, w2*w1]
constraint_matrix.append(constraint)
constraint_matrix = np.array(constraint_matrix)
u, s, v_t = np.linalg.svd(constraint_matrix)
least_squares = v_t[-1] # last column of V
F = least_squares.reshape((3, 3))
# Enforce rank 2
u, s, v_t = np.linalg.svd(F)
s = np.diag(s)
s[2, 2] = 0
F = u @ s @ v_t
plot_epipolar_lines(image_1.copy(), image_2.copy(), im_1_points, im_2_points, F)
For a proof of this result refer to 9.6.2 in Multiple View Geometry in Computer Vision and this Wikipedia article. The proof is out-of-scope and would require too much space.
For a given essential matrix $\mathbf{E}$ with SVD decomposition $\mathbf{E} = \mathbf{U} \text{diag}(1,1,0)\mathbf{V^T}$ and first camera matrix $\mathbf{P} = [\mathbf{I} | \mathbf{0}]$, there are our possible choices for the second camera $\mathbf{P'}$, namely
$$ \mathbf{P'} = [\mathbf{UWV^T} | +\mathbf{u_3}] \hspace{0.5em} \text{ or } \hspace{0.5em} [\mathbf{UWV^T} | -\mathbf{u_3}] \hspace{0.5em} \text{ or } \hspace{0.5em} [\mathbf{UW^TV^T} | +\mathbf{u_3}] \hspace{0.5em} \text{ or } \hspace{0.5em} [\mathbf{UW^TV^T} | -\mathbf{u_3}]$$Where $\mathbf{W} = \begin{bmatrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix} $
This 4 solution ambiguity can be reduced to a single solution. This is the solution that places all points in front of the cameras after triangulation.
We compute all 4 solutions and check which solution has points with only positive z coordinates relative to the camera.
def decompose_essential(essential):
""" Decomposes an essential matrix into a relative camera pose.
Args:
essential (np.ndarray): The essential matrix.
Returns:
pose (np.ndarray): The extrinsic matrix of the second camera.
"""
w = np.array([[0, -1, 0],
[1, 0, 0],
[0, 0, 1]])
u, s, v_t = np.linalg.svd(essential)
# Translation hypothesis
u_3 = u[:, 2, np.newaxis]
# Rotation hypothesis
rot_a = u @ w @ v_t
rot_b = u @ w.T @ v_t
rot_a = -rot_a if np.linalg.det(rot_a) < 0 else rot_a
rot_b = -rot_b if np.linalg.det(rot_b) < 0 else rot_b
# 4 motion hypothoses as extrinsic matrices
extr_2_a = np.hstack([rot_a, u_3])
extr_2_b = np.hstack([rot_a, -u_3])
extr_2_c = np.hstack([rot_b, u_3])
extr_2_d = np.hstack([rot_b, -u_3])
# Assume cam 1 = [I | 0]
extr_1 = np.array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0]])
cam_center_1 = np.zeros(3)
inv_intrinsics = np.linalg.inv(intrinsics[:3, :3])
inv_extrinsic_1 = np.linalg.pinv(extr_1)
vec_in_cam_ref_1 = inv_intrinsics @ im_1_points.T[:3]
vec_in_world_1 = inv_extrinsic_1 @ vec_in_cam_ref_1
pose = None
# Triangulate points for each of the 4 solutions
for extr_2 in [extr_2_a, extr_2_b, extr_2_c, extr_2_d]:
cam_center_2 = -extr_2[:3, :3] @ extr_2[:, 3].T
inv_extrinsic_2 = np.linalg.pinv(extr_2)
vec_in_cam_ref_2 = inv_intrinsics @ im_2_points.T[:3]
vec_in_world_2 = inv_extrinsic_2 @ vec_in_cam_ref_2
rel_zs = np.array([])
for vec_1, vec_2 in zip(vec_in_world_1.T, vec_in_world_2.T):
triangulated, _, _ = triangulate(cam_center_1, vec_1, cam_center_2, vec_2)
triangulated = np.append(triangulated, 1)
x1, y1, z1 = extr_1 @ triangulated # Triangulated point in camera 1 ref
x2, y2, z2 = extr_2 @ triangulated # Triangulated point in camera 2 ref
rel_zs = np.append(rel_zs, [z1, z2])
if (rel_zs > 0).all():
pose = extr_2
return pose
The ground truth pose of the second camera is shown in red.
The inferred pose of the second camera is shown in blue.
As can be seen, the pose can be very decently inferred from point correspondences.
The careful reader will notice that the relative translation is artificially scaled by a "magic" number.
This is because the essential matrix has no information on the scale of the relative camera translation.
For details on why this is, see this post.
We do however know the scale of the scene due to the chessboard calibration object, which is how I manually found the translation scale 8.6.
init_3d_plot()
plot_chessboard(object_points)
plot_camera_wireframe(cam_center_2_orig, vis_scale, np.linalg.inv(extrinsic_2_orig), color='red')
essential = intrinsics[:3, :3].T @ F @ intrinsics[:3, :3]
rel_extrinsic_2 = decompose_essential(essential)
# Convert relative pose from essential matrix decomposition to absolute poses.
rel_extrinsic_2[:, 3] *= 8.6
extrinsic_2 = rel_extrinsic_2 @ extrinsic_1
cam_center_2 = -extrinsic_2[:3, :3].T @ extrinsic_2[:3, 3]
extrinsic_2 = np.vstack([extrinsic_2, [0, 0, 0, 1]])
extrinsics[1] = extrinsic_2
cam_centers[1] = cam_center_2
for idx, (cam_center, extrinsic, image) in enumerate(zip(cam_centers, extrinsics, images)):
inv_extrinsic = np.linalg.pinv(extrinsic)
inv_intrinsics = np.linalg.inv(intrinsics)
plot_camera_wireframe(cam_center, vis_scale, inv_extrinsic)
plot_picture(image, inv_extrinsic, vis_scale)
ipv.show()
In this notebook we
Having an intuition of how epipolar geometry works will serve us well.
It is both a tool for problem-solving and a theoretical basis for many computer vision algorithms.
This section gives an intuitive proof of why for a homogenous equation (or data) matrix $\mathbf{A}$ in $\mathbf{A}b = 0$, the last column of the SVD's $\mathbf{V}$ matrix is the least-squares solution. For a homogenous least-squares problem, we are interested in finding a parameter vector $b$ such that
$$ \mathbf{A}b = 0 $$subject to the constraint $ \lVert b \rVert ^{2}=1$ to avoid the arbitrary solution. Because there may be no exact solution, this is phrased as a minimization problem.
$$ \arg \underset{b}{\min} \lVert \mathbf{A}b \rVert^2_2 $$The SVD decomposes a matrix into two orthonormal matrices $ \mathbf {U, V^{T}} $ and one diagonal matrix with non-negative real numbers $\mathbf{\Sigma}$. Interpreting the SVD as a PCA, $\mathbf{V}$ encodes the principal axes of the data and $\mathbf{\Sigma}$ the amount of variance of the data when projected onto these principal axes in descending order, i.e. $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_n$(for details on the SVD $\leftrightarrow$PCA connection, see this Princeton tutorial).
$$ \mathbf{A} \rightarrow \mathbf {U\Sigma V^{T}} $$Matrix multiplication with an orthonormal matrix does not affect the norm of a vector, thus we can remove $\mathbf{U}$ from the equation.
$$ \arg \underset{b}{\min} \lVert \mathbf {U \Sigma V^{T}}b \rVert = \arg \underset{b}{\min} \lVert \mathbf {\Sigma V^{T}}b \rVert$$We further substitute with $y = \mathbf{V^{T}} b$, with constraint $ \lVert y \rVert ^{2}=1$ due to orthonormality of $\mathbf{V^{T}}$.
$$ \arg \underset{b}{\min} \lVert \mathbf {\Sigma V^{T}}b \rVert =arg \underset{y}{\min} \lVert \mathbf {\Sigma} y \rVert $$Because the singular values $\sigma$ in $\mathbf{\Sigma}$ encode the variance in descending order, the optimal $y$ is trivially $(0,\dots,0,1)^T$. $$ y = (0,\dots,0,1)^T$$ $$ \mathbf{V^{T}} b = (0,\dots,0,1)^T$$ $$ \mathbf{V V^{T}}b = \mathbf{V} (0,\dots,0,1)^T $$ $$ b = \mathbf{V} (0,\dots,0,1)^T $$ $$ b = \text{last column of } \mathbf{V} $$ making the optimal $b$ the last column of $\mathbf{V}$.
This proof was taken from this Quora post by Samarth Brahmbhatt. Additional references to other proofs have been added where needed.
A fundamental matrix is given by the equation $ \mathbf{x'}^T \mathbf{F} \mathbf{x} = 0 $. Now consider an epipolar line $ l' = \mathbf{F} \mathbf{x} $. The right epipole $ \mathbf{e'} $ lies on this line, so $ \mathbf{e'}^T l' = 0 $ or $ \mathbf{e'} ^T \mathbf{F} \mathbf{x} = 0 $ for all $ \mathbf{x} $. This implies that $ \mathbf{e'} ^T \mathbf{F} = 0 $. Similarly, one can prove that $ \mathbf{F} \mathbf{e} = 0 $. Hence $ \mathbf{F} $ has a null space which is not just the zero vector. So $ \mathbf{F} $ is rank deficient.
The proof that $ \mathbf{F} $ has of rank exactly 2 comes from the fact that $ \mathbf{F} $ is constructed from the essential matrix $ \mathbf{E}$:
$$ \mathbf{F} = (\mathbf{K'^{-1}})^T \mathbf{E} \mathbf{K^{-1}} $$where the $ \mathbf{K} $'s are intrinsic matrices of the two cameras. Now, $ \mathbf{E} = [T_{\times}]R $ where $ R $ is the rotation matrix relating the two camera co-ordinate systems and $ [T_{\times}] = \begin{pmatrix} 0 & -t_z & t_y\\ t_z & 0 & -t_x\\ -t_y & t_x & 0 \end{pmatrix} $. For a proof of this identity, see the wikipedia page. A little bit of manipulation will show that one column of $ [T_{\times}] $ is a linear combination of the other two columns. So $ [T_{\times}] $ has rank 2. For a proof of this see this row reduced solution.
Hence any matrix that you construct by multiplying other matrices with $ [T_{\times}] $ (such as $ \mathbf{E} $ and $ \mathbf{F} $) will also have at most rank 2.