Abstract: The relationship between physical systems and intelligence has long fascinated researchers in computer science and physics. This talk explores fundamental connections between thermodynamic systems and intelligent decision-making through the lens of free energy principles.
We examine how concepts from statistical mechanics - particularly the relationship between total energy, free energy, and entropy - might provide novel insights into the nature of intelligence and learning. By drawing parallels between physical systems and information processing, we consider how measurement and observation can be viewed as processes that modify available energy. The discussion encompasses how model approximations and uncertainties might be understood through thermodynamic analogies, and explores the implications of treating intelligence as an energy-efficient state-change process.
While these connections remain speculative, they offer intriguing perspectives for discussing the fundamental nature of intelligence and learning systems. The talk aims to stimulate discussion about these potential relationships rather than present definitive conclusions.
::: {.cell .markdown}
import notutils as nu
nu.display_google_book(id='3yRVAAAAcAAJ', page='PP7')
Figure: Daniel Bernoulli’s Hydrodynamica published in 1738. It was one of the first works to use the idea of conservation of energy. It used Newton’s laws to predict the behaviour of gases.
Daniel Bernoulli described a kinetic theory of gases, but it wasn’t until 170 years later when these ideas were verified after Einstein had proposed a model of Brownian motion which was experimentally verified by Jean Baptiste Perrin.
import notutils as nu
nu.display_google_book(id='3yRVAAAAcAAJ', page='PA200')
Figure: Daniel Bernoulli’s chapter on the kinetic theory of gases, for a review on the context of this chapter see Mikhailov (n.d.). For 1738 this is extraordinary thinking. The notion of kinetic theory of gases wouldn’t become fully accepted in Physics until 1908 when a model of Einstein’s was verified by Jean Baptiste Perrin.
import numpy as np
p = np.random.randn(10000, 1)
xlim = [-4, 4]
x = np.linspace(xlim[0], xlim[1], 200)
y = 1/np.sqrt(2*np.pi)*np.exp(-0.5*x*x)
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
ax.plot(x, y, 'r', linewidth=3)
ax.hist(p, 100, density=True)
ax.set_xlim(xlim)
mlai.write_figure('gaussian-histogram.svg', directory='./ml')
Another important figure for Cambridge was the first to derive the probability distribution that results from small balls banging together in this manner. In doing so, James Clerk Maxwell founded the field of statistical physics.
Figure: James Clerk Maxwell 1831-1879 Derived distribution of velocities of particles in an ideal gas (elastic fluid).
![]() |
![]() |
![]() |
Figure: James Clerk Maxwell (1831-1879), Ludwig Boltzmann (1844-1906) Josiah Willard Gibbs (1839-1903)
Many of the ideas of early statistical physicists were rejected by a cadre of physicists who didn’t believe in the notion of a molecule. The stress of trying to have his ideas established caused Boltzmann to commit suicide in 1906, only two years before the same ideas became widely accepted.
import notutils as nu
nu.display_google_book(id='Vuk5AQAAMAAJ', page='PA373')
Figure: Boltzmann’s paper Boltzmann (n.d.) which introduced the relationship between entropy and probability. A translation with notes is available in Sharp and Matschinsky (2015).
The important point about the uncertainty being represented here is that it is not genuine stochasticity, it is a lack of knowledge about the system. The techniques proposed by Maxwell, Boltzmann and Gibbs allow us to exactly represent the state of the system through a set of parameters that represent the sufficient statistics of the physical system. We know these values as the volume, temperature, and pressure. The challenge for us, when approximating the physical world with the techniques we will use is that we will have to sit somewhere between the deterministic and purely stochastic worlds that these different scientists described.
One ongoing characteristic of people who study probability and uncertainty is the confidence with which they hold opinions about it. Another leader of the Cavendish laboratory expressed his support of the second law of thermodynamics (which can be proven through the work of Gibbs/Boltzmann) with an emphatic statement at the beginning of his book.
![]() |
![]() |
Figure: Eddington’s book on the Nature of the Physical World (Eddington, 1929)
The same Eddington is also famous for dismissing the ideas of a young Chandrasekhar who had come to Cambridge to study in the Cavendish lab. Chandrasekhar demonstrated the limit at which a star would collapse under its own weight to a singularity, but when he presented the work to Eddington, he was dismissive suggesting that there “must be some natural law that prevents this abomination from happening”.
![]() |
![]() |
Figure: Chandrasekhar (1910-1995) derived the limit at which a star collapses in on itself. Eddington’s confidence in the 2nd law may have been what drove him to dismiss Chandrasekhar’s ideas, humiliating a young scientist who would later receive a Nobel prize for the work.
Figure: Eddington makes his feelings about the primacy of the second law clear. This primacy is perhaps because the second law can be demonstrated mathematically, building on the work of Maxwell, Gibbs and Boltzmann. Eddington (1929)
Presumably he meant that the creation of a black hole seemed to transgress the second law of thermodynamics, although later Hawking was able to show that blackholes do evaporate, but the time scales at which this evaporation occurs is many orders of magnitude slower than other processes in the universe.
[edit]
Maxwell’s demon is a thought experiment described by James Clerk Maxwell in his book, Theory of Heat (Maxwell, 1871) on page 308.
But if we conceive a being whose faculties are so sharpened that he can follow every molecule in its course, such a being, whose attributes are still as essentially finite as our own, would be able to do what is at present impossible to us. For we have seen that the molecules in a vessel full of air at uniform temperature are moving with velocities by no means uniform, though the mean velocity of any great number of them, arbitrarily selected, is almost exactly uniform. Now let us suppose that such a vessel is divided into two portions, A and B, by a division in which there is a small hole, and that a being, who can see the individual molecules, opens and closes this hole, so as to allow only the swifter molecules to pass from A to B, and the only the slower ones to pass from B to A. He will thus, without expenditure of work, raise the temperature of B and lower that of A, in contradiction to the second law of thermodynamics.
James Clerk Maxwell in Theory of Heat (Maxwell, 1871) page 308
He goes onto say:
This is only one of the instances in which conclusions which we have draw from our experience of bodies consisting of an immense number of molecules may be found not to be applicable to the more delicate observations and experiments which we may suppose made by one who can perceive and handle the individual molecules which we deal with only in large masses
import notutils as nu
nu.display_google_book(id='0p8AAAAAMAAJ', page='PA308')
Figure: Maxwell’s demon was designed to highlight the statistical nature of the second law of thermodynamics.
Entropy:
Figure: Maxwell’s Demon. The demon decides balls are either cold (blue) or hot (red) according to their velocity. Balls are allowed to pass the green membrane from right to left only if they are cold, and from left to right, only if they are hot.
Maxwell’s demon allows us to connect thermodynamics with information theory (see e.g. Hosoya et al. (2015);Hosoya et al. (2011);Bub (2001);Brillouin (1951);Szilard (1929)). The connection arises due to a fundamental connection between information erasure and energy consumption Landauer (1961).
Alemi and Fischer (2019)
[edit]
Information theory provides a mathematical framework for quantifying information. Many of information theory’s core concepts parallel those found in thermodynamics. The theory was developed by Claude Shannon who spoke extensively to MIT’s Norbert Wiener at while it was in development (Conway and Siegelman, 2005). Wiener’s own ideas about information were inspired by Willard Gibbs, one of the pioneers of the mathematical understanding of free energy and entropy. Deep connections between physical systems and information processing have connected information and energy from the start.
Shannon’s entropy measures the uncertainty or unpredictability of information content. This mathematical formulation is inspired by thermodynamic entropy, which describes the dispersal of energy in physical systems. Both concepts quantify the number of possible states and their probabilities.
Figure: Maxwell’s demon thought experiment illustrates the relationship between information and thermodynamics.
In thermodynamics, free energy represents the energy available to do work. A system naturally evolves to minimize its free energy, finding equilibrium between total energy and entropy. Free energy principles are also pervasive in variational methods in machine learning. They emerge from Bayesian approaches to learning and have been heavily promoted by e.g. Karl Friston as a model for the brain.
The relationship between entropy and Free Energy can be explored through the Legendre transform. This is most easily reviewed if we restrict ourselves to distributions in the exponential family.
The exponential family has the form ρ(Z)=h(Z)exp(θ⊤T(Z)+A(θ))
In machine learning and Bayesian inference, the Markov blanket is the set of variables that are conditionally independent of the variable of interest given the other variables. To introduce this idea into our information system, we first split the system into two parts, the variables, X, and the memory M.
The variables are the portion of the system that is stochastically evolving over time. The memory is a low entropy partition of the system that will give us knowledge about this evolution.
We can now write the joint entropy of the system in terms of the mutual information between the variables and the memory. S(Z)=S(X,M)=S(X|M)+S(M)=S(X)−I(X;M)+S(M).
If M is viewed as a measurement then the change in entropy of the system before and after measurement is given by S(X|M)−S(X) wehich is given by −I(X;M). This is implies that measurement increases the amount of available energy we obtain from the system (Parrondo et al., 2015).
The difference in available energy is given by ΔA=A(X)−A(Z|M)=I(X;M),
In the game of 20 Questions player one (Alice) thinks of an object, player two (Bob) must identify it by asking at most 20 yes/no questions. The optimal strategy is to divide the possibility space in half with each question. The binary search approach ensures maximum information gain with each inquiry and can access 220 or about a million different objects.
Figure: The optimal strategy in the Entropy Game resembles a binary search, dividing the search space in half with each question.
From an information-theoretic perspective, decisions can be taken in a way that efficiently reduces entropy - our the uncertainty about the state of the world. Each observation or action an intelligent agent takes should maximize expected information gain, optimally reducing uncertainty given available resources.
The entropy before the question is S(X). The entropy after the question is S(X|M). The information gain is the difference between the two, I(X;M)=S(X)−S(X|M). Optimal decision making systems maximize this information gain per unit cost.
The entropy game connects decision-making to thermodynamics.
This perspective suggests a profound connection: intelligence might be understood as a special case of systems that efficiently extract, process, and utilize free energy from their environments, with thermodynamic principles setting fundamental constraints on what’s possible.
The second law of thermodynamics was generalised to include the effect of measurement by Sagawa and Ueda (Sagawa and Ueda, 2008). They showed that the maximum extractable work from a system can be increased by kBTI(X;M) where kB is Boltzmann’s constant, T is temperature and I(X;M) is the information gained by making a measurement, M, I(X;M)=∑x,mρ(x,m)logρ(x,m)ρ(x)ρ(m),
The measurements can be seen as a thermodynamic process. In theory measurement, like computation is reversible. But in practice the process of measurement is likely to erode the free energy somewhat, but as long as the energy gained from information, kTI(X;M) is greater than that spent in measurement the pricess can be thermodynamically efficient.
The modified second law shows that the maximum additional extractable work is proportional to the information gained. So information acquisition creates extractable work potential. Thermodynamic consistency is maintained by properly accounting for information-entropy relationships.
Sagawa and Ueda extended this relationship to provide a generalised Jarzynski equality for feedback processes (Sagawa and Ueda, 2010). The Jarzynski equality is an imporant result from nonequilibrium thermodynamics that relates the average work done across an ensemble to the free energy difference between initial and final states (Jarzynski, 1997), ⟨exp(−WkBT)⟩=exp(−ΔFkBT),
Sagawa and Ueda introduce an efficacy term that captures the effect of feedback on the system they note in the presence of feedback, ⟨exp(−WkBT)exp(ΔFkBT)⟩=γ,
When viewing M as an information channel between past and future states, Shannon’s channel coding theorems apply (Shannon, 1948). The channel capacity C represents the maximum rate of reliable information transmission [ C = _{(M)} I(X_1;M) ] and for a memory of n bits we have [ C n, ] as the mutual information is upper bounded by the entropy of ρ(M) which is at most n bits.
This relationship seems to align with Ashby’s Law of Requisite Variety (pg 229 Ashby (1952)), which states that a control system must have at least as much ‘variety’ as the system it aims to control. In the context of memory systems, this means that to maintain temporal correlations effectively, the memory’s state space must be at least as large as the information content it needs to preserve. This provides a lower bound on the necessary memory capacity that complements the bound we get from Shannon for channel capacity.
This helps determine the required memory size for maintaining temporal correlations, optimal coding strategies, and fundamental limits on temporal correlation preservation.
Intelligent systems must balance measurement against energy efficiency and time requirements. A perfect model of the world would require infinite computational resources and speed, so approximations are necessary. This leads to uncertainties. Thermodynamics might be thought of as the physics of uncertainty: at equilibrium thermodynamic systems find thermodynamic states that minimize free energy, equivalent to maximising entropy.
To introduce some structure to the model assumption. We split X into X0 and X1. X0 is past and present of the system, X1 is future The conditional mutual information I(X0;X1|M) which is zero if X1 and X0 are independent conditioned on M.
The equipartition theorem tells us that at equilibrium the average energy is kT/2 per degree of freedom. This means that for systems that operate at “human scale” the energy involved is many orders of magnitude larger than the amount of information we can store in memory. For a car engine producing 70 kW of power at 370 Kelvin, this implies 2×70,000370×kB=2×70,000370×1.380649×10−23=2.74×1025
While macroscopic systems operate in regimes where traditional thermodynamics dominates, microscopic biological systems operate at scales where information and thermal fluctuations become critically important. Here we examine how the framework applies to molecular machines and processes that have evolved to operate efficiently at these scales.
Molecular machines like ATP synthase, kinesin motors, and the photosynthetic apparatus can be viewed as sophisticated information engines that convert energy while processing information about their environment. These systems have evolved to exploit thermal fluctuations rather than fight against them, using information processing to extract useful work.
ATP synthase functions as a rotary molecular motor that synthesizes ATP from ADP and inorganic phosphate using a proton gradient. The system uses the proton gradient as both an energy source and an information source about the cell’s energetic state and exploits Brownian motion through a ratchet mechanism. It converts information about proton locations into mechanical rotation and ultimately chemical energy with approximately 3-4 protons required per ATP.
from IPython.lib.display import YouTubeVideo
YouTubeVideo('kXpzp4RDGJI')
Estimates suggest that one synapse firing may require 104 ATP molecules, so around 4×104 protons. If we take the human brain as containing around 1014 synapses, and if we suggest each synapse only fires about once every five seconds, we would require approximately 1018 protons per second to power the synapses in our brain. With each proton having six degrees of freedom. Under these rough calculations the memory capacity distributed across the ATP Synthase in our brain must be of order 6×1018 bits per second or 750 petabytes of information per second. Of course this memory capacity would be devolved across the billions of neurons within hundreds or thousands of mitochondria that each can contain thousands of ATP synthase molecules. By composition of extremely small systems we can see it’s possible to improve efficiencies in ways that seem very impractical for a car engine.
Quick note to clarify, here we’re referring to the information requirements to make our brain more energy efficient in its information processing rather than the information processing capabilities of the neurons themselves!
[edit]
In his seminal 1957 paper (Jaynes, 1957), Ed Jaynes proposed a foundation for statistical mechanics based on information theory. Rather than relying on ergodic hypotheses or ensemble interpretations, Jaynes recast that the problem of assigning probabilities in statistical as a problem of inference with incomplete information.
A central problem in statistical mechanics is assigning initial probabilities when our knowledge is incomplete. For example, if we know only the average energy of a system, what probability distribution should we use? Jaynes argued that we should use the distribution that maximizes entropy subject to the constraints of our knowledge.
Jaynes illustrated the approachwith a simple example: Suppose a die has been tossed many times, with an average result of 4.5 rather than the expected 3.5 for a fair die. What probability assignment Pn (n=1,2,...,6) should we make for the next toss?
We need to satisfy two constraints
Many distributions could satisfy these constraints, but which one makes the fewest unwarranted assumptions? Jaynes argued that we should choose the distribution that is maximally noncommittal with respect to missing information - the one that maximizes the entropy, This principle leads to the exponential family of distributions, which in statistical mechanics gives us the canonical ensemble and other familiar distributions.
For a more general case, suppose a quantity x can take values (x1,x2,…,xn) and we know the average values of several functions fk(x). The problem is to find the probability assignment pi=p(xi) that satisfies and maximizes the entropy SI=−∑ni=1pilogpi.
Using Lagrange multipliers, the solution is the generalized canonical distribution, where Z(λ1,…,λm) is the partition function, The Lagrange multipliers λk are determined by the constraints, The maximum attainable entropy is
[edit]
Jaynes’ World is a zero-player game that implements a version of the entropy game. The dynamical system is defined by a distribution, ρ(Z), over a state space Z. The state space is partitioned into observable variables X and memory variables M. The memory variables are considered to be in an information resevoir, a thermodynamic system that maintains information in an ordered state (see e.g. Barato and Seifert (2014)). The entropy of the whole system is bounded below by 0 and above by N. So the entropy forms a compact manifold with respect to its parameters.
Unlike the animal game, where decisions are made by reducing entropy at each step, our system evovles mathematically by maximising the instantaneous entropy production. Conceptually we can think of this as ascending the gradient of the entropy, S(Z).
In the animal game the questioner starts with maximum uncertainty and targets minimal uncertainty. Jaynes’ world starts with minimal uncertainty and aims for maximum uncertainty.
We can phrase this as a thought experiment. Imagine you are in the game, at a given turn. You want to see where the game came from, so you look back across turns. The direction the game came from is now the direction of steepest descent. Regardless of where the game actually started it looks like it started at a minimal entropy configuration that we call the origin. Similarly, wherever the game is actually stopped there will nevertheless appear to be an end point we call end that will be a configuration of maximal entropy, N.
This speculation allows us to impose the functional form of our proability distribution. As Jaynes has shown (Jaynes, 1957), the stationary points of a free-form optimisation (minimum or maximum) will place the distribution in the, ρ(Z) in the exponential family, ρ(Z)=h(Z)exp(θ⊤T(Z)−A(θ)),
This constraint to the exponential family is highly convenient as we will rely on it heavily for the dynamics of the game. In particular, by focussing on the natural parameters we find that we are optimising within an information geometry (Amari, 2016). In exponential family distributions, the entropy gradient is given by, ∇θS(Z)=g=∇2θA(θ(M))
We are now in a position to summarise the start state and the end state of our system, as well as to speculate on the nature of the transition between the two states.
The origin configuration is a low entropy state, with value near the lower bound of 0. The information is highly structured, by definition we place all variables in M, the information resevoir at this time. The uncertainty principle is present to handle the competeing needs of precision in parameters (giving us the near-singular form for θ(M), and capacity in the information channel that M provides (the capacity c(θ) is upper bounded by S(M).
The end configuration is a high entropy state, near the upper bound. Both the minimal entropy and maximal entropy states are revealed by Ed Jaynes’ variational minimisation approach and are in the exponential family. In many cases a version of Zeno’s paradox will arise where the system asymtotes to the final state, taking smaller steps at each time. At this point the system is at equilibrium.
[edit]
Jaynes formulated his principle in terms of maximizing entropy, we can also view certain problems as minimizing entropy under appropriate constraints. The duality becomes apparent when we consider the relationship between entropy and information.
The maximum entropy principle finds the distribution that is maximally noncommittal given certain constraints. Conversely, we can seek the distribution that minimizes entropy subject to different constraints - this represents the distribution with maximum structure or information.
Consider the uncertainty principle. When we seek states that minimize the product of position and momentum uncertainties, we are seeking minimal entropy states subject to the constraint of the uncertainty principle.
The mathematical formalism remains the same, but with different constraints and optimization direction, where gk are functions representing constraints different from simple averages.
The solution still takes the form of an exponential family, where μk are Lagrange multipliers for the constraints.
The pure states of quantum mechanics are those that minimize von Neumann entropy S=−Tr(ρlogρ) subject to the constraints of quantum mechanics.
For example, coherent states minimize the entropy subject to constraints on the expectation values of position and momentum operators. These states achieve the minimum uncertainty allowed by quantum mechanics.
import numpy as np
First we write some helper code to plot the histogram and compute its entropy.
import matplotlib.pyplot as plt
import mlai.plot as plot
def plot_histogram(ax, p, max_height=None):
heights = p
if max_height is None:
max_height = 1.25*heights.max()
# Safe entropy calculation that handles zeros
nonzero_p = p[p > 0] # Filter out zeros
S = - (nonzero_p*np.log2(nonzero_p)).sum()
# Define bin edges
bins = [1, 2, 3, 4, 5] # Bin edges
# Create the histogram
if ax is None:
fig, ax = plt.subplots(figsize=(6, 4)) # Adjust figure size
ax.hist(bins[:-1], bins=bins, weights=heights, align='left', rwidth=0.8, edgecolor='black') # Use weights for probabilities
# Customize the plot for better slide presentation
ax.set_xlabel("Bin")
ax.set_ylabel("Probability")
ax.set_title(f"Four Bin Histogram (Entropy {S:.3f})")
ax.set_xticks(bins[:-1]) # Show correct x ticks
ax.set_ylim(0,max_height) # Set y limit for visual appeal
We can compute the entropy of any given histogram.
# Define probabilities
p = np.zeros(4)
p[0] = 4/13
p[1] = 3/13
p[2] = 3.7/13
p[3] = 1 - p.sum()
# Safe entropy calculation
nonzero_p = p[p > 0] # Filter out zeros
entropy = - (nonzero_p*np.log2(nonzero_p)).sum()
print(f"The entropy of the histogram is {entropy:.3f}.")
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
fig.tight_layout()
plot_histogram(ax, p)
ax.set_title(f"Four Bin Histogram (Entropy {entropy:.3f})")
mlai.write_figure(filename='four-bin-histogram.svg',
directory = './information-game')
Figure: The entropy of a four bin histogram.
We can play the entropy game by starting with a histogram with all the probability mass in the first bin and then ascending the gradient of the entropy function.
[edit]
The simplest possible example of Jaynes’ World is a two-bin histogram with probabilities p and 1−p. This minimal system allows us to visualize the entire entropy landscape.
The natural parameter is the log odds, θ=logp1−p, and the update given by the entropy gradient is Δθsteepest=ηdSdθ=ηp(1−p)(log(1−p)−logp).
import numpy as np
# Python code for gradients
p_values = np.linspace(0.000001, 0.999999, 10000)
theta_values = np.log(p_values/(1-p_values))
entropy = -p_values * np.log(p_values) - (1-p_values) * np.log(1-p_values)
fisher_info = p_values * (1-p_values)
gradient = fisher_info * (np.log(1-p_values) - np.log(p_values))
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=plot.big_wide_figsize)
ax1.plot(theta_values, entropy)
ax1.set_xlabel('$\\theta$')
ax1.set_ylabel('Entropy $S(p)$')
ax1.set_title('Entropy Landscape')
ax2.plot(theta_values, gradient)
ax2.set_xlabel('$\\theta$')
ax2.set_ylabel('$\\nabla_\\theta S(p)$')
ax2.set_title('Entropy Gradient vs. Position')
mlai.write_figure(filename='two-bin-histogram-entropy-gradients.svg',
directory = './information-game')
Figure: Entropy gradients of the two bin histogram agains position.
This example reveals the entropy extrema at p=0, p=0.5, and p=1. At minimal entropy (p≈0 or p≈1), the gradient approaches zero, creating natural information reservoirs. The dynamics slow dramatically near these points - these are the areas of critical slowing that create information reservoirs.
We can visualize the entropy maximization process by performing gradient ascent in the natural parameter space θ. Starting from a low-entropy state, we follow the gradient of entropy with respect to θ to reach the maximum entropy state.
import numpy as np
# Helper functions for two-bin histogram
def theta_to_p(theta):
"""Convert natural parameter theta to probability p"""
return 1.0 / (1.0 + np.exp(-theta))
def p_to_theta(p):
"""Convert probability p to natural parameter theta"""
# Add small epsilon to avoid numerical issues
p = np.clip(p, 1e-10, 1-1e-10)
return np.log(p/(1-p))
def entropy(theta):
"""Compute entropy for given theta"""
p = theta_to_p(theta)
# Safe entropy calculation
return -p * np.log2(p) - (1-p) * np.log2(1-p)
def entropy_gradient(theta):
"""Compute gradient of entropy with respect to theta"""
p = theta_to_p(theta)
return p * (1-p) * (np.log2(1-p) - np.log2(p))
def plot_histogram(ax, theta, max_height=None):
"""Plot two-bin histogram for given theta"""
p = theta_to_p(theta)
heights = np.array([p, 1-p])
if max_height is None:
max_height = 1.25
# Compute entropy
S = entropy(theta)
# Create the histogram
bins = [1, 2, 3] # Bin edges
if ax is None:
fig, ax = plt.subplots(figsize=(6, 4))
ax.hist(bins[:-1], bins=bins, weights=heights, align='left', rwidth=0.8, edgecolor='black')
# Customize the plot
ax.set_xlabel("Bin")
ax.set_ylabel("Probability")
ax.set_title(f"Two-Bin Histogram (Entropy {S:.3f})")
ax.set_xticks(bins[:-1])
ax.set_ylim(0, max_height)
# Parameters for gradient ascent
theta_initial = -9.0 # Start with low entropy
learning_rate = 1
num_steps = 1500
# Initialize
theta_current = theta_initial
theta_history = [theta_current]
p_history = [theta_to_p(theta_current)]
entropy_history = [entropy(theta_current)]
# Perform gradient ascent in theta space
for step in range(num_steps):
# Compute gradient
grad = entropy_gradient(theta_current)
# Update theta
theta_current = theta_current + learning_rate * grad
# Store history
theta_history.append(theta_current)
p_history.append(theta_to_p(theta_current))
entropy_history.append(entropy(theta_current))
if step % 100 == 0:
print(f"Step {step+1}: θ = {theta_current:.4f}, p = {p_history[-1]:.4f}, Entropy = {entropy_history[-1]:.4f}")
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
# Create a figure showing the evolution
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.tight_layout(pad=3.0)
# Select steps to display
steps_to_show = [0, 300, 600, 900, 1200, 1500]
# Plot histograms for selected steps
for i, step in enumerate(steps_to_show):
row, col = i // 3, i % 3
plot_histogram(axes[row, col], theta_history[step])
axes[row, col].set_title(f"Step {step}: θ = {theta_history[step]:.2f}, p = {p_history[step]:.3f}")
mlai.write_figure(filename='two-bin-histogram-evolution.svg',
directory = './information-game')
# Plot entropy evolution
plt.figure(figsize=(10, 6))
plt.plot(range(num_steps+1), entropy_history, 'o-')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title('Entropy Evolution During Gradient Ascent')
plt.grid(True)
mlai.write_figure(filename='two-bin-entropy-evolution.svg',
directory = './information-game')
# Plot trajectory in theta space
plt.figure(figsize=(10, 6))
theta_range = np.linspace(-5, 5, 1000)
entropy_curve = [entropy(t) for t in theta_range]
plt.plot(theta_range, entropy_curve, 'b-', label='Entropy Landscape')
plt.plot(theta_history, entropy_history, 'ro-', label='Gradient Ascent Path')
plt.xlabel('Natural Parameter θ')
plt.ylabel('Entropy')
plt.title('Gradient Ascent Trajectory in Natural Parameter Space')
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.legend()
plt.grid(True)
mlai.write_figure(filename='two-bin-trajectory.svg',
directory = './information-game')
Figure: Evolution of the two-bin histogram during gradient ascent in natural parameter space.
Figure: Entropy evolution during gradient ascent for the two-bin histogram.
Figure: Gradient ascent trajectory in the natural parameter space for the two-bin histogram.
The gradient ascent visualization shows how the system evolves in the natural parameter space θ. Starting from a negative θ (corresponding to a low-entropy state with p<<0.5), the system follows the gradient of entropy with respect to θ until it reaches θ=0 (corresponding to p=0.5), which is the maximum entropy state.
Note that the maximum entropy occurs at θ=0, which corresponds to p=0.5. The gradient of entropy with respect to θ is zero at this point, making it a stable equilibrium for the gradient ascent process.
[edit]
One challenge is how to parameterise our exponential family. We’ve mentioned that the variables Z are partitioned into observable variables X and memory variables M. Given the minimal entropy initial state, the obvious initial choice is that at the origin all variables, Z, should be in the information reservoir, M. This implies that they are well determined and present a sensible choice for the source of our parameters.
We define a mapping, θ(M), that maps the information resevoir to a set of values that are equivalent to the natural parameters. If the entropy of these parameters is low, and the distribution ρ(θ) is sharply peaked then we can move from treating the memory mapping, θ(⋅), as a random processe to an assumption that it is a deterministic function. We can then follow gradients with respect to these θ values.
This allows us to rewrite the distribution over Z in a conditional form, ρ(X|M)=h(X)exp(θ(M)⊤T(X)−A(θ(M))).
Unfortunately this assumption implies that θ(⋅) is a delta function, and since our representation as a compact manifold (bounded below by 0 and above by N) it does not admit any such singularities.
This creates an apparent paradox, at minimal entropy states, the information reservoir must simultaneously maintain precision in the parameters θ(M) (for accurate system representation) but it must also provide sufficient capacity c(M) (for information storage).
The trade-off can be expressed as, Δθ(M)⋅Δc(M)≥k,
This trade-off between precision and capacity directly parallels Shannon’s insights about information transmission (Shannon, 1948), where he demonstrated that increasing the precision of a signal requires increasing bandwidth or reducing noise immunity—creating an inherent trade-off in any communication system. Our formulation extends this principle to the information reservoir’s parameter space.
In practice this means that the parameters θ(M) and capacity variables c(M) must form a Fourier-dual pair, c(M)=F[θ(M)],
The mathematical formulation of the uncertainty principle comes from Hirschman Jr (1957) and later refined by Beckner (1975) and Białynicki-Birula and Mycielski (1975). These works demonstrated that Shannon’s information-theoretic entropy provides a natural framework for expressing quantum uncertainty, establishing a direct bridge between quantum mechanics and information theory. Our capacity-precision trade-off follows this tradition, expressing the fundamental limits of information processing in our system.
The uncertainty principle means that the game can exhibit quantum-like information processing regimes during evolution. This inspires an information-theoretic perspective on the quantum-classical transition.
At minimal entropy states near the origin, the information reservoir has characteristics reminiscent of quantum systems.
Wave-like information encoding: The information reservoir near the origin necessarily encodes information in distributed, interference-capable patterns due to the uncertainty principle between parameters θ(M) and capacity variables c(M).
Non-local correlations: Parameters are highly correlated through the Fisher information matrix, creating structures where information is stored in relationships rather than individual variables.
Uncertainty-saturated regime: The uncertainty relationship Δθ(M)⋅Δc(M)≥k is nearly saturated (approaches equality), similar to Heisenberg’s uncertainty principle in quantum systems and the entropic uncertainty relations established by Białynicki-Birula and Mycielski (1975).
As the system evolves towards higher entropy states, a transition occurs where some variables exhibit classical behavior.
From wave-like to particle-like: Variables transitioning from M to X shift from storing information in interference patterns to storing it in definite values with statistical uncertainty.
Decoherence-like process: The uncertainty product Δθ(M)⋅Δc(M) for these variables grows significantly larger than the minimum value k, indicating a departure from quantum-like behavior.
Local information encoding: Information becomes increasingly encoded in local variables rather than distributed correlations.
The saddle points in our entropy landscape mark critical transitions between quantum-like and classical information processing regimes. Near these points
The critically slowed modes maintain quantum-like characteristics, functioning as coherent memory that preserves information through interference patterns.
The rapidly evolving modes exhibit classical characteristics, functioning as incoherent processors that manipulate information through statistical operations.
This natural separation creates a hybrid computational architecture where quantum-like memory interfaces with classical-like processing.
The quantum-classical transition can be quantified using the moment generating function MZ(t). In quantum-like regimes, the MGF exhibits oscillatory behavior with complex analytic structure, whereas in classical regimes, it grows monotonically with simple analytic structure. The transition between these behaviors identifies variables moving between quantum-like and classical information processing modes.
This perspective suggests that what we recognize as “quantum” versus “classical” behavior may fundamentally reflect different regimes of information processing - one optimized for coherent information storage (quantum-like) and the other for flexible information manipulation (classical-like). The emergence of both regimes from our entropy-maximizing model indicates that nature may exploit this computational architecture to optimize information processing across multiple scales.
This formulation of the uncertainty principle in terms of information capacity and parameter precision follows the tradition established by Shannon (1948) and expanded upon by Hirschman Jr (1957) and others who connected information entropy uncertainty to Heisenberg’s uncertainty.
[edit]
In Jaynes (1957) Jaynes showed how the maximum entropy formalism is applied, in later papers such as Jaynes (1963) he showed how his maximum entropy formalism could be applied to von Neumann entropy of a density matrix.
As Jaynes noted in his 1962 Brandeis lectures: “Assignment of initial probabilities must, in order to be useful, agree with the initial information we have (i.e., the results of measurements of certain parameters). For example, we might know that at time t=0, a nuclear spin system having total (measured) magnetic moment M(0), is placed in a magnetic field H, and the problem is to predict the subsequent variation M(t)… What initial density matrix for the spin system ρ(0), should we use?”
Jaynes recognized that we should choose the density matrix that maximizes the von Neumann entropy, subject to constraints from our measurements, where Mop is the operator corresponding to total magnetic moment.
The solution is the quantum version of the maximum entropy distribution, where Ai are the operators corresponding to measured observables, λi are Lagrange multipliers, and Z=Tr[exp(−λ1A1−⋯−λmAm)] is the partition function.
This unifies classical entropies and density matrix entropies under the same information-theoretic principle. It clarifies that quantum states with minimum entropy (pure states) represent maximum information, while mixed states represent incomplete information.
Jaynes further noted that “strictly speaking, all this should be restated in terms of quantum theory using the density matrix formalism. This will introduce the N! permutation factor, a natural zero for entropy, alteration of numerical values if discreteness of energy levels becomes comparable to kBT, etc.”
[edit]
The minimal entropy quantum states provides a connection between density matrices and exponential family distributions. This connection enables us to use many of the classical techniques from information geometry and apply them to the game in the case where the uncertainty principle is present.
The minimal entropy density matrix belongs to an exponential family, just like many classical distributions,
The matrix G in the minimal entropy state is directly related to the ‘quantum Fisher information matrix’, G=QFIM/4
This creates a link between
The relationship implies, V⋅QFIM≥ℏ24
These minimal entropy states may have physical relationships to interpretations squeezed states in quantum optics. They are the states that achieve the ultimate precision allowed by quantum mechanics.
[edit]
In Jaynes’ World, we begin at a minimal entropy configuration - the “origin” state. Understanding the properties of these minimal entropy states is crucial for characterizing how the system evolves. These states are constrained by the uncertainty principle we previously identified: Δθ(M)⋅Δc(M)≥k.
This constraint is reminiscient of the Heisenberg uncertainty principle in quantum mechanics, where Δx⋅Δp≥ℏ/2. This isn’t a coincidence - both represent limitations on precision arising from the mathematical structure of information. The total entropy of the system is constrained to be between 0 and N, forming a compact manifold with respect to its parameters. This upper bound N ensures that as the system evolves from minimal to maximal entropy, it remains within a well-defined entropy space.
The minimal entropy configuration under the uncertainty constraint takes a specific mathematical form. It is a pure state (in the sense of having minimal possible entropy) that exactly saturates the uncertainty bound. For a system with multiple degrees of freedom, the distribution takes a Gaussian form, ρ(Z)=1Zexp(−RT⋅G⋅R),
This form is an exponential family distribution, in line with Jaynes’ principle that entropy-optimized distributions belong to the exponential family. The matrix G determines how uncertainty is distributed among different variables and their correlations.
// … existing code …
[edit]
In our exploration of information dynamics, we now turn to the relationship between gradient ascent on entropy and uncertainty principles. This section demonstrates how systems naturally evolve from quantum-like states (with minimal uncertainty) toward classical-like states (with excess uncertainty) through entropy maximization.
For simplicity, we’ll focus on multivariate Gaussian distributions, where the uncertainty relationships are particularly elegant. In this setting, the precision matrix Λ (inverse of the covariance matrix) fully characterizes the distribution. The entropy of a multivariate Gaussian is directly related to the determinant of the covariance matrix, S=12logdet(V)+constant,
For conjugate variables like position and momentum, the Heisenberg uncertainty principle imposes constraints on the minimum product of their uncertainties. In our information-theoretic framework, this appears as a constraint on the determinant of certain submatrices of the covariance matrix.
import numpy as np
from scipy.linalg import eigh
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
The code below implements gradient ascent on the entropy of a multivariate Gaussian system while respecting uncertainty constraints. We’ll track how the system evolves from minimal uncertainty states (quantum-like) to states with excess uncertainty (classical-like).
First, we define key functions for computing entropy and its gradient.
# Constants
hbar = 1.0 # Normalized Planck's constant
min_uncertainty_product = hbar/2
# Compute entropy of a multivariate Gaussian with precision matrix Lambda
def compute_entropy(Lambda):
"""
Compute entropy of multivariate Gaussian with precision matrix Lambda.
Parameters:
-----------
Lambda: array
Precision matrix
Returns:
--------
entropy: float
Entropy value
"""
# Covariance matrix is inverse of precision matrix
V = np.linalg.inv(Lambda)
# Entropy formula for multivariate Gaussian
n = Lambda.shape[0]
entropy = 0.5 * np.log(np.linalg.det(V)) + 0.5 * n * (1 + np.log(2*np.pi))
return entropy
# Compute gradient of entropy with respect to precision matrix
def compute_entropy_gradient(Lambda):
"""
Compute gradient of entropy with respect to precision matrix.
Parameters:
-----------
Lambda: array
Precision matrix
Returns:
--------
gradient: array
Gradient of entropy
"""
# Gradient is -0.5 * inverse of Lambda
V = np.linalg.inv(Lambda)
gradient = -0.5 * V
return gradient
The compute_entropy
function calculates the entropy of a multivariate
Gaussian distribution from its precision matrix. The
compute_entropy_gradient
function computes the gradient of entropy
with respect to the precision matrix, which is essential for our
gradient ascent procedure.
Next, we implement functions to handle the constraints imposed by the uncertainty principle:
# Project gradient to respect uncertainty constraints
def project_gradient(eigenvalues, gradient):
"""
Project gradient to respect minimum uncertainty constraints.
Parameters:
-----------
eigenvalues: array
Eigenvalues of precision matrix
gradient: array
Gradient vector
Returns:
--------
projected_gradient: array
Gradient projected to respect constraints
"""
n_pairs = len(eigenvalues) // 2
projected_gradient = gradient.copy()
# For each position-momentum pair
for i in range(n_pairs):
idx1, idx2 = 2*i, 2*i+1
# Check if we're at the uncertainty boundary
product = 1.0 / (eigenvalues[idx1] * eigenvalues[idx2])
if product <= min_uncertainty_product * 1.01:
# We're at or near the boundary
# Project gradient to maintain the product
avg_grad = 0.5 * (gradient[idx1]/eigenvalues[idx1] + gradient[idx2]/eigenvalues[idx2])
projected_gradient[idx1] = avg_grad * eigenvalues[idx1]
projected_gradient[idx2] = avg_grad * eigenvalues[idx2]
return projected_gradient
# Initialize a multidimensional state with position-momentum pairs
def initialize_multidimensional_state(n_pairs, squeeze_factors=None, with_cross_connections=False):
"""
Initialize a precision matrix for multiple position-momentum pairs.
Parameters:
-----------
n_pairs: int
Number of position-momentum pairs
squeeze_factors: list or None
Factors determining the position-momentum squeezing
with_cross_connections: bool
Whether to initialize with cross-connections between pairs
Returns:
--------
Lambda: array
Precision matrix
"""
if squeeze_factors is None:
squeeze_factors = [0.1 + 0.05*i for i in range(n_pairs)]
# Total dimension (position + momentum)
dim = 2 * n_pairs
# Initialize with diagonal precision matrix
eigenvalues = np.zeros(dim)
# Set eigenvalues based on squeeze factors
for i in range(n_pairs):
squeeze = squeeze_factors[i]
eigenvalues[2*i] = 1.0 / (squeeze * min_uncertainty_product)
eigenvalues[2*i+1] = 1.0 / (min_uncertainty_product / squeeze)
# Initialize with identity eigenvectors
eigenvectors = np.eye(dim)
# If requested, add cross-connections by mixing eigenvectors
if with_cross_connections:
# Create a random orthogonal matrix for mixing
Q, _ = np.linalg.qr(np.random.randn(dim, dim))
# Apply moderate mixing - not fully random to preserve some structure
mixing_strength = 0.3
eigenvectors = (1 - mixing_strength) * eigenvectors + mixing_strength * Q
# Re-orthogonalize
eigenvectors, _ = np.linalg.qr(eigenvectors)
# Construct precision matrix from eigendecomposition
Lambda = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
return Lambda
The project_gradient
function ensures that our gradient ascent
respects the uncertainty principle by projecting the gradient to
maintain minimum uncertainty products when necessary. The
initialize_multidimensional_state
function creates a starting state
with multiple position-momentum pairs, each initialized to the minimum
uncertainty allowed by the uncertainty principle, but with different
“squeeze factors” that determine the shape of the uncertainty ellipse.
# Add gradient check function
def check_entropy_gradient(Lambda, epsilon=1e-6):
"""
Check the analytical gradient of entropy against numerical gradient.
Parameters:
-----------
Lambda: array
Precision matrix
epsilon: float
Small perturbation for numerical gradient
Returns:
--------
analytical_grad: array
Analytical gradient with respect to eigenvalues
numerical_grad: array
Numerical gradient with respect to eigenvalues
"""
# Get eigendecomposition
eigenvalues, eigenvectors = eigh(Lambda)
# Compute analytical gradient
analytical_grad = entropy_gradient(eigenvalues)
# Compute numerical gradient
numerical_grad = np.zeros_like(eigenvalues)
for i in range(len(eigenvalues)):
# Perturb eigenvalue up
eigenvalues_plus = eigenvalues.copy()
eigenvalues_plus[i] += epsilon
Lambda_plus = eigenvectors @ np.diag(eigenvalues_plus) @ eigenvectors.T
entropy_plus = compute_entropy(Lambda_plus)
# Perturb eigenvalue down
eigenvalues_minus = eigenvalues.copy()
eigenvalues_minus[i] -= epsilon
Lambda_minus = eigenvectors @ np.diag(eigenvalues_minus) @ eigenvectors.T
entropy_minus = compute_entropy(Lambda_minus)
# Compute numerical gradient
numerical_grad[i] = (entropy_plus - entropy_minus) / (2 * epsilon)
# Compare
print("Analytical gradient:", analytical_grad)
print("Numerical gradient:", numerical_grad)
print("Difference:", np.abs(analytical_grad - numerical_grad))
return analytical_grad, numerical_grad
Now we implement the main gradient ascent procedure.
# Perform gradient ascent on entropy
def gradient_ascent_entropy(Lambda_init, n_steps=100, learning_rate=0.01):
"""
Perform gradient ascent on entropy while respecting uncertainty constraints.
Parameters:
-----------
Lambda_init: array
Initial precision matrix
n_steps: int
Number of gradient steps
learning_rate: float
Learning rate for gradient ascent
Returns:
--------
Lambda_history: list
History of precision matrices
entropy_history: list
History of entropy values
"""
Lambda = Lambda_init.copy()
Lambda_history = [Lambda.copy()]
entropy_history = [compute_entropy(Lambda)]
for step in range(n_steps):
# Compute gradient of entropy
grad_matrix = compute_entropy_gradient(Lambda)
# Diagonalize Lambda to work with eigenvalues
eigenvalues, eigenvectors = eigh(Lambda)
# Transform gradient to eigenvalue space
grad = np.diag(eigenvectors.T @ grad_matrix @ eigenvectors)
# Project gradient to respect constraints
proj_grad = project_gradient(eigenvalues, grad)
# Update eigenvalues
eigenvalues += learning_rate * proj_grad
# Ensure eigenvalues remain positive
eigenvalues = np.maximum(eigenvalues, 1e-10)
# Reconstruct Lambda from updated eigenvalues
Lambda = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
# Store history
Lambda_history.append(Lambda.copy())
entropy_history.append(compute_entropy(Lambda))
return Lambda_history, entropy_history
The gradient_ascent_entropy
function implements the core optimization
procedure. It performs gradient ascent on the entropy while respecting
the uncertainty constraints. The algorithm works in the eigenvalue space
of the precision matrix, which makes it easier to enforce constraints
and ensure the matrix remains positive definite.
To analyze the results, we implement functions to track uncertainty metrics and detect interesting dynamics:
# Track uncertainty products and regime classification
def track_uncertainty_metrics(Lambda_history):
"""
Track uncertainty products and classify regimes for each conjugate pair.
Parameters:
-----------
Lambda_history: list
History of precision matrices
Returns:
--------
metrics: dict
Dictionary containing uncertainty metrics over time
"""
n_steps = len(Lambda_history)
n_pairs = Lambda_history[0].shape[0] // 2
# Initialize tracking arrays
uncertainty_products = np.zeros((n_steps, n_pairs))
regimes = np.zeros((n_steps, n_pairs), dtype=object)
for step, Lambda in enumerate(Lambda_history):
# Get covariance matrix
V = np.linalg.inv(Lambda)
# Calculate Fisher information matrix
G = Lambda / 2
# For each conjugate pair
for i in range(n_pairs):
# Extract 2x2 submatrix for this pair
idx1, idx2 = 2*i, 2*i+1
V_sub = V[np.ix_([idx1, idx2], [idx1, idx2])]
# Compute uncertainty product (determinant of submatrix)
uncertainty_product = np.sqrt(np.linalg.det(V_sub))
uncertainty_products[step, i] = uncertainty_product
# Classify regime
if abs(uncertainty_product - min_uncertainty_product) < 0.1*min_uncertainty_product:
regimes[step, i] = "Quantum-like"
else:
regimes[step, i] = "Classical-like"
return {
'uncertainty_products': uncertainty_products,
'regimes': regimes
}
The track_uncertainty_metrics
function analyzes the evolution of
uncertainty products for each position-momentum pair and classifies them
as either “quantum-like” (near minimum uncertainty) or “classical-like”
(with excess uncertainty). This classification helps us understand how
the system transitions between these regimes during entropy
maximization.
We also implement a function to detect saddle points in the gradient flow, which are critical for understanding the system’s dynamics:
# Detect saddle points in the gradient flow
def detect_saddle_points(Lambda_history):
"""
Detect saddle-like behavior in the gradient flow.
Parameters:
-----------
Lambda_history: list
History of precision matrices
Returns:
--------
saddle_metrics: dict
Metrics related to saddle point behavior
"""
n_steps = len(Lambda_history)
n_pairs = Lambda_history[0].shape[0] // 2
# Track eigenvalues and their gradients
eigenvalues_history = np.zeros((n_steps, 2*n_pairs))
gradient_ratios = np.zeros((n_steps, n_pairs))
for step, Lambda in enumerate(Lambda_history):
# Get eigenvalues
eigenvalues, _ = eigh(Lambda)
eigenvalues_history[step] = eigenvalues
# For each pair, compute ratio of gradients
if step > 0:
for i in range(n_pairs):
idx1, idx2 = 2*i, 2*i+1
# Change in eigenvalues
delta1 = abs(eigenvalues_history[step, idx1] - eigenvalues_history[step-1, idx1])
delta2 = abs(eigenvalues_history[step, idx2] - eigenvalues_history[step-1, idx2])
# Ratio of max to min (high ratio indicates saddle-like behavior)
max_delta = max(delta1, delta2)
min_delta = max(1e-10, min(delta1, delta2)) # Avoid division by zero
gradient_ratios[step, i] = max_delta / min_delta
# Identify candidate saddle points (where some gradients are much larger than others)
saddle_candidates = []
for step in range(1, n_steps):
if np.any(gradient_ratios[step] > 10): # Threshold for saddle-like behavior
saddle_candidates.append(step)
return {
'eigenvalues_history': eigenvalues_history,
'gradient_ratios': gradient_ratios,
'saddle_candidates': saddle_candidates
}
The detect_saddle_points
function identifies points in the gradient
flow where some eigenvalues change much faster than others, indicating
saddle-like behavior. These saddle points are important because they
represent critical transitions in the system’s evolution.
Finally, we implement visualization functions to help us understand the system’s behavior:
# Visualize uncertainty ellipses for multiple pairs
def plot_multidimensional_uncertainty(Lambda_history, step_indices, pairs_to_plot=None):
"""
Plot the evolution of uncertainty ellipses for multiple position-momentum pairs.
Parameters:
-----------
Lambda_history: list
History of precision matrices
step_indices: list
Indices of steps to visualize
pairs_to_plot: list, optional
Indices of position-momentum pairs to plot
"""
n_pairs = Lambda_history[0].shape[0] // 2
if pairs_to_plot is None:
pairs_to_plot = range(min(3, n_pairs)) # Plot up to 3 pairs by default
fig, axes = plt.subplots(len(pairs_to_plot), len(step_indices),
figsize=(4*len(step_indices), 3*len(pairs_to_plot)))
# Handle case of single pair or single step
if len(pairs_to_plot) == 1:
axes = axes.reshape(1, -1)
if len(step_indices) == 1:
axes = axes.reshape(-1, 1)
for row, pair_idx in enumerate(pairs_to_plot):
for col, step in enumerate(step_indices):
ax = axes[row, col]
Lambda = Lambda_history[step]
covariance = np.linalg.inv(Lambda)
# Extract 2x2 submatrix for this pair
idx1, idx2 = 2*pair_idx, 2*pair_idx+1
cov_sub = covariance[np.ix_([idx1, idx2], [idx1, idx2])]
# Get eigenvalues and eigenvectors of submatrix
values, vectors = eigh(cov_sub)
# Calculate ellipse parameters
angle = np.degrees(np.arctan2(vectors[1, 0], vectors[0, 0]))
width, height = 2 * np.sqrt(values)
# Create ellipse
ellipse = Ellipse((0, 0), width=width, height=height, angle=angle,
edgecolor='blue', facecolor='lightblue', alpha=0.5)
# Add to plot
ax.add_patch(ellipse)
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.set_aspect('equal')
ax.grid(True)
# Add minimum uncertainty circle
min_circle = plt.Circle((0, 0), min_uncertainty_product,
fill=False, color='red', linestyle='--')
ax.add_patch(min_circle)
# Compute uncertainty product
uncertainty_product = np.sqrt(np.linalg.det(cov_sub))
# Determine regime
if abs(uncertainty_product - min_uncertainty_product) < 0.1*min_uncertainty_product:
regime = "Quantum-like"
color = 'red'
else:
regime = "Classical-like"
color = 'blue'
# Add labels
if row == 0:
ax.set_title(f"Step {step}")
if col == 0:
ax.set_ylabel(f"Pair {pair_idx+1}")
# Add uncertainty product text
ax.text(0.05, 0.95, f"ΔxΔp = {uncertainty_product:.2f}",
transform=ax.transAxes, fontsize=10, verticalalignment='top')
# Add regime text
ax.text(0.05, 0.85, regime, transform=ax.transAxes,
fontsize=10, verticalalignment='top', color=color)
ax.set_xlabel("Position")
ax.set_ylabel("Momentum")
plt.tight_layout()
return fig
The plot_multidimensional_uncertainty
function visualizes the
uncertainty ellipses for multiple position-momentum pairs at different
steps of the gradient ascent process. These visualizations help us
understand how the system transitions from quantum-like to
classical-like regimes.
This implementation builds on the InformationReservoir
class we saw
earlier, but generalizes to multiple position-momentum pairs and focuses
specifically on the uncertainty relationships. The key connection is
that both implementations track how systems naturally evolve from
minimal entropy states (with quantum-like uncertainty relations) toward
maximum entropy states (with classical-like uncertainty relations).
As the system evolves through gradient ascent, we observe transitions.
Uncertainty desaturation: The system begins with a minimal entropy state that exactly saturates the uncertainty bound (Δx⋅Δp=ℏ/2). As entropy increases, this bound becomes less tightly saturated.
Shape transformation: The initial highly squeezed uncertainty ellipse (with small position uncertainty and large momentum uncertainty) gradually becomes more circular, representing a more balanced distribution of uncertainty.
Quantum-to-classical transition: The system transitions from a quantum-like regime (where uncertainty is at the minimum allowed by quantum mechanics) to a more classical-like regime (where statistical uncertainty dominates over quantum uncertainty).
This evolution reveals how information naturally flows from highly ordered configurations toward maximum entropy states, while still respecting the fundamental constraints imposed by the uncertainty principle.
In systems with multiple position-momentum pairs, the gradient ascent process encounters saddle points which trigger a natural slowdown. The system naturally slows down near saddle points, with some eigenvalue pairs evolving quickly while others hardly change. These saddle points represent partially equilibrated states where some degrees of freedom have reached maximum entropy while others remain ordered. At these critical points, some variables maintain quantum-like characteristics (uncertainty saturation) while others exhibit classical-like behavior (excess uncertainty).
This natural separation creates a hybrid system where quantum-like memory interfaces with classical-like processing - emerging naturally from the geometry of the entropy landscape under uncertainty constraints.
import numpy as np
from scipy.linalg import eigh
# Constants
hbar = 1.0 # Normalized Planck's constant
min_uncertainty_product = hbar/2
# Verify gradient calculation
print("Testing gradient calculation:")
test_Lambda = np.array([[2.0, 0.5], [0.5, 1.0]]) # Example precision matrix
analytical_grad, numerical_grad = check_entropy_gradient(test_Lambda)
# Verify if we're ascending or descending
entropy_before = compute_entropy(test_Lambda)
eigenvalues, eigenvectors = eigh(test_Lambda)
step_size = 0.01
eigenvalues_after = eigenvalues + step_size * analytical_grad
test_Lambda_after = eigenvectors @ np.diag(eigenvalues_after) @ eigenvectors.T
entropy_after = compute_entropy(test_Lambda_after)
print(f"Entropy before step: {entropy_before}")
print(f"Entropy after step: {entropy_after}")
print(f"Change in entropy: {entropy_after - entropy_before}")
if entropy_after > entropy_before:
print("We are ascending the entropy gradient")
else:
print("We are descending the entropy gradient")
test_grad = compute_entropy_gradient(test_Lambda)
print(f"Precision matrix:\n{test_Lambda}")
print(f"Entropy gradient:\n{test_grad}")
print(f"Entropy: {compute_entropy(test_Lambda):.4f}")
# Initialize system with 2 position-momentum pairs
n_pairs = 2
Lambda_init = initialize_multidimensional_state(n_pairs, squeeze_factors=[0.1, 0.5])
# Run gradient ascent
n_steps = 100
Lambda_history, entropy_history = gradient_ascent_entropy(Lambda_init, n_steps, learning_rate=0.01)
# Track metrics
uncertainty_metrics = track_uncertainty_metrics(Lambda_history)
saddle_metrics = detect_saddle_points(Lambda_history)
# Print results
print("\nFinal entropy:", entropy_history[-1])
print("Initial uncertainty products:", uncertainty_metrics['uncertainty_products'][0])
print("Final uncertainty products:", uncertainty_metrics['uncertainty_products'][-1])
print("Saddle point candidates at steps:", saddle_metrics['saddle_candidates'])
# Plot entropy evolution
plt.figure(figsize=plot.big_wide_figsize)
plt.plot(entropy_history)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title('Entropy Evolution During Gradient Ascent')
plt.grid(True)
mlai.write_figure(filename='entropy-evolution-during-gradient-ascent.svg',
directory='./information-game')
# Plot uncertainty products evolution
plt.figure(figsize=plot.big_wide_figsize)
for i in range(n_pairs):
plt.plot(uncertainty_metrics['uncertainty_products'][:, i],
label=f'Pair {i+1}')
plt.axhline(y=min_uncertainty_product, color='k', linestyle='--',
label='Minimum uncertainty')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Uncertainty Product (ΔxΔp)')
plt.title('Evolution of Uncertainty Products')
plt.legend()
plt.grid(True)
mlai.write_figure(filename='uncertainty-products-evolution.svg',
directory='./information-game')
# Plot uncertainty ellipses at key steps
step_indices = [0, 20, 50, 99] # Initial, early, middle, final
plot_multidimensional_uncertainty(Lambda_history, step_indices)
# Plot eigenvalues evolution
plt.subplots(figsize=plot.big_wide_figsize)
for i in range(2*n_pairs):
plt.semilogy(saddle_metrics['eigenvalues_history'][:, i],
label=f'$\\lambda_{i+1}$')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Eigenvalue (log scale)')
plt.title('Evolution of Precision Matrix Eigenvalues')
plt.legend()
plt.grid(True)
plt.tight_layout()
mlai.write_figure(filename='eigenvalue-evolution.svg',
directory='./information-game')
Figure: Eigenvalue evolution during gradient ascent.
Figure: Uncertainty products evolution during gradient ascent.
Figure: Entropy evolution during gradient ascent.
Figure: .
[edit]
The uncertainty principle between parameters θ and capacity variables c is a fundamental feature of information reservoirs. We can visualize this uncertainty relation using phase space plots.
We can demonstrate how the uncertainty principle manifests in different regimes:
Quantum-like regime: Near minimal entropy, the uncertainty product Δθ⋅Δc approaches the lower bound k, creating wave-like interference patterns in probability space.
Transitional regime: As entropy increases, uncertainty relations begin to decouple, with Δθ⋅Δc>k.
Classical regime: At high entropy, parameter uncertainty dominates, creating diffusion-like dynamics with minimal influence from uncertainty relations.
The visualization shows probability distributions for these three regimes in both parameter space and capacity space.
import numpy as np
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
from matplotlib.patches import Ellipse
# Visualization of uncertainty ellipses
fig, ax = plt.subplots(figsize=plot.big_figsize)
# Parameters for uncertainty ellipses
k = 1 # Uncertainty constant
centers = [(0, 0), (2, 2), (4, 4)]
widths = [0.25, 0.5, 2]
heights = [4, 2.5, 2]
#heights = [k/w for w in widths]
colors = ['blue', 'green', 'red']
labels = ['Quantum-like', 'Transitional', 'Classical']
# Plot uncertainty ellipses
for center, width, height, color, label in zip(centers, widths, heights, colors, labels):
ellipse = Ellipse(center, width, height,
edgecolor=color, facecolor='none',
linewidth=2, label=label)
ax.add_patch(ellipse)
# Add text label
ax.text(center[0], center[1] + height/2 + 0.2,
label, ha='center', color=color)
# Add area label (uncertainty product)
area = width * height
ax.text(center[0], center[1] - height/2 - 0.3,
f'Area = {width:.2f} $\\times$ {height: .2f} $\\pi$', ha='center')
# Set axis labels and limits
ax.set_xlabel('Parameter $\\theta$')
ax.set_ylabel('Capacity $C$')
ax.set_xlim(-3, 7)
ax.set_ylim(-3, 7)
ax.set_aspect('equal')
ax.grid(True, linestyle='--', alpha=0.7)
ax.set_title('Parameter-Capacity Uncertainty Relation')
# Add diagonal line representing constant uncertainty product
x = np.linspace(0.25, 6, 100)
y = k/x
ax.plot(x, y, 'k--', alpha=0.5, label='Minimum uncertainty: $\\Delta \\theta \\Delta C = k$')
ax.legend(loc='upper right')
mlai.write_figure(filename='uncertainty-ellipses.svg',
directory = './information-game')
Figure: Visualisaiton of the uncertainty trade-off between parameter precision and capacity.
This visualization helps explain why information reservoirs with quantum-like properties naturally emerge at minimal entropy. The uncertainty principle is not imposed but arises naturally from the constraints of Shannon information theory applied to physical systems operating at minimal entropy.
[edit]
We now extend our analysis to much larger systems with thousands of position-momentum pairs. This allows us to observe emergent statistical behaviors and phase transitions that aren’t apparent in smaller systems.
Large-scale systems reveal how microscopic uncertainty constraints lead to macroscopic statistical patterns. By analyzing thousands of position-momentum pairs simultaneously, we can identify emergent behaviors and natural clustering of dynamical patterns.
# Optimized implementation for very large systems
def large_scale_gradient_ascent(n_pairs, steps=100, learning_rate=1, sample_interval=5):
"""
Memory-efficient implementation of gradient ascent for very large systems.
Parameters:
-----------
n_pairs: int
Number of position-momentum pairs
steps: int
Number of gradient steps to take
learning_rate: float
Step size for gradient ascent
sample_interval: int
Store state every sample_interval steps to save memory
Returns:
--------
sampled_states: list
Sparse history of states at sampled intervals
entropy_history: list
Complete history of entropy values
uncertainty_metrics: dict
Metrics tracking uncertainty products over time
"""
# Initialize with diagonal precision matrix (no need to store full matrix)
dim = 2 * n_pairs
eigenvalues = np.zeros(dim)
# Initialize with minimal entropy state
for i in range(n_pairs):
squeeze = 0.1 * (1 + (i % 10)) # Cycle through 10 different squeeze factors
eigenvalues[2*i] = 1.0 / (squeeze * min_uncertainty_product)
eigenvalues[2*i+1] = 1.0 / (min_uncertainty_product / squeeze)
# Storage for results (sparse to save memory)
sampled_states = []
entropy_history = []
uncertainty_products = np.zeros((steps+1, n_pairs))
# Initial entropy and uncertainty
entropy = 0.5 * (dim * (1 + np.log(2*np.pi)) - np.sum(np.log(eigenvalues)))
entropy_history.append(entropy)
# Track initial uncertainty products
for i in range(n_pairs):
uncertainty_products[0, i] = 1.0 / np.sqrt(eigenvalues[2*i] * eigenvalues[2*i+1])
# Store initial state
sampled_states.append(eigenvalues.copy())
# Gradient ascent loop
for step in range(steps):
# Compute gradient with respect to eigenvalues (diagonal precision)
grad = -1.0 / (2.0 * eigenvalues)
# Project gradient to respect constraints
for i in range(n_pairs):
idx1, idx2 = 2*i, 2*i+1
# Current uncertainty product (in eigenvalue space, this is inverse)
current_product = eigenvalues[idx1] * eigenvalues[idx2]
# If we're already at minimum uncertainty, project gradient
if abs(current_product - 1/min_uncertainty_product**2) < 1e-6:
# Tangent direction preserves the product
tangent = np.array([-eigenvalues[idx2], -eigenvalues[idx1]])
tangent = tangent / np.linalg.norm(tangent)
# Project the gradient onto this tangent
pair_gradient = np.array([grad[idx1], grad[idx2]])
projection = np.dot(pair_gradient, tangent) * tangent
grad[idx1] = projection[0]
grad[idx2] = projection[1]
# Update eigenvalues
eigenvalues += learning_rate * grad
# Ensure eigenvalues remain positive
eigenvalues = np.maximum(eigenvalues, 1e-10)
# Compute entropy
entropy = 0.5 * (dim * (1 + np.log(2*np.pi)) - np.sum(np.log(eigenvalues)))
entropy_history.append(entropy)
# Track uncertainty products
for i in range(n_pairs):
uncertainty_products[step+1, i] = 1.0 / np.sqrt(eigenvalues[2*i] * eigenvalues[2*i+1])
# Store state at sampled intervals
if step % sample_interval == 0 or step == steps-1:
sampled_states.append(eigenvalues.copy())
# Compute regime classifications
regimes = np.zeros((steps+1, n_pairs), dtype=object)
for step in range(steps+1):
for i in range(n_pairs):
if abs(uncertainty_products[step, i] - min_uncertainty_product) < 0.1*min_uncertainty_product:
regimes[step, i] = "Quantum-like"
else:
regimes[step, i] = "Classical-like"
uncertainty_metrics = {
'uncertainty_products': uncertainty_products,
'regimes': regimes
}
return sampled_states, entropy_history, uncertainty_metrics
# Add gradient check function for large systems
def check_large_system_gradient(n_pairs=10, epsilon=1e-6):
"""
Check the analytical gradient against numerical gradient for a large system.
Parameters:
-----------
n_pairs: int
Number of position-momentum pairs to test
epsilon: float
Small perturbation for numerical gradient
Returns:
--------
max_diff: float
Maximum difference between analytical and numerical gradients
"""
# Initialize a small test system
dim = 2 * n_pairs
eigenvalues = np.zeros(dim)
# Initialize with minimal entropy state
for i in range(n_pairs):
squeeze = 0.1 * (1 + (i % 10))
eigenvalues[2*i] = 1.0 / (squeeze * min_uncertainty_product)
eigenvalues[2*i+1] = 1.0 / (min_uncertainty_product / squeeze)
# Compute analytical gradient
analytical_grad = -1.0 / (2.0 * eigenvalues)
# Compute numerical gradient
numerical_grad = np.zeros_like(eigenvalues)
# Function to compute entropy from eigenvalues
def compute_entropy_from_eigenvalues(evals):
return 0.5 * (dim * (1 + np.log(2*np.pi)) - np.sum(np.log(evals)))
# Initial entropy
base_entropy = compute_entropy_from_eigenvalues(eigenvalues)
# Compute numerical gradient
for i in range(dim):
# Perturb eigenvalue up
eigenvalues_plus = eigenvalues.copy()
eigenvalues_plus[i] += epsilon
entropy_plus = compute_entropy_from_eigenvalues(eigenvalues_plus)
# Perturb eigenvalue down
eigenvalues_minus = eigenvalues.copy()
eigenvalues_minus[i] -= epsilon
entropy_minus = compute_entropy_from_eigenvalues(eigenvalues_minus)
# Compute numerical gradient
numerical_grad[i] = (entropy_plus - entropy_minus) / (2 * epsilon)
# Compare
diff = np.abs(analytical_grad - numerical_grad)
max_diff = np.max(diff)
avg_diff = np.mean(diff)
print(f"Gradient check for {n_pairs} position-momentum pairs:")
print(f"Maximum difference: {max_diff:.8f}")
print(f"Average difference: {avg_diff:.8f}")
# Verify gradient ascent direction
step_size = 0.01
eigenvalues_after = eigenvalues + step_size * analytical_grad
entropy_after = compute_entropy_from_eigenvalues(eigenvalues_after)
print(f"Entropy before step: {base_entropy:.6f}")
print(f"Entropy after step: {entropy_after:.6f}")
print(f"Change in entropy: {entropy_after - base_entropy:.6f}")
if entropy_after > base_entropy:
print("✓ Gradient ascent confirmed: entropy increases")
else:
print("✗ Error: entropy decreases with gradient step")
return max_diff
# Analyze statistical properties of large-scale system
def analyze_large_system(uncertainty_metrics, n_pairs, steps):
"""
Analyze statistical properties of a large-scale system.
Parameters:
-----------
uncertainty_metrics: dict
Metrics from large_scale_gradient_ascent
n_pairs: int
Number of position-momentum pairs
steps: int
Number of gradient steps taken
Returns:
--------
analysis: dict
Statistical analysis results
"""
uncertainty_products = uncertainty_metrics['uncertainty_products']
regimes = uncertainty_metrics['regimes']
# Compute statistics over time
mean_uncertainty = np.mean(uncertainty_products, axis=1)
std_uncertainty = np.std(uncertainty_products, axis=1)
min_uncertainty_over_time = np.min(uncertainty_products, axis=1)
max_uncertainty_over_time = np.max(uncertainty_products, axis=1)
# Count regime transitions
quantum_count = np.zeros(steps+1)
for step in range(steps+1):
quantum_count[step] = np.sum(regimes[step] == "Quantum-like")
# Identify clusters of similar behavior
from sklearn.cluster import KMeans
# Reshape to have each pair as a sample with its uncertainty trajectory as features
pair_trajectories = uncertainty_products.T # shape: (n_pairs, steps+1)
# Use fewer clusters for very large systems
n_clusters = min(10, n_pairs // 100)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(pair_trajectories)
# Count pairs in each cluster
cluster_counts = np.bincount(cluster_labels, minlength=n_clusters)
# Get representative pairs from each cluster (closest to centroid)
representative_pairs = []
for i in range(n_clusters):
cluster_members = np.where(cluster_labels == i)[0]
if len(cluster_members) > 0:
# Find pair closest to cluster centroid
centroid = kmeans.cluster_centers_[i]
distances = np.linalg.norm(pair_trajectories[cluster_members] - centroid, axis=1)
closest_idx = cluster_members[np.argmin(distances)]
representative_pairs.append(closest_idx)
return {
'mean_uncertainty': mean_uncertainty,
'std_uncertainty': std_uncertainty,
'min_uncertainty': min_uncertainty_over_time,
'max_uncertainty': max_uncertainty_over_time,
'quantum_count': quantum_count,
'quantum_fraction': quantum_count / n_pairs,
'cluster_counts': cluster_counts,
'representative_pairs': representative_pairs,
'cluster_labels': cluster_labels
}
# Visualize results for large-scale system
def visualize_large_system(sampled_states, entropy_history, uncertainty_metrics, analysis, n_pairs, steps):
"""
Create visualizations for large-scale system results.
Parameters:
-----------
sampled_states: list
Sparse history of eigenvalues
entropy_history: list
History of entropy values
uncertainty_metrics: dict
Uncertainty metrics over time
analysis: dict
Statistical analysis results
n_pairs: int
Number of position-momentum pairs
steps: int
Number of gradient steps taken
"""
# Plot entropy evolution
plt.figure(figsize=(10, 6))
plt.plot(entropy_history)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title(f'Entropy Evolution for {n_pairs} Position-Momentum Pairs')
plt.grid(True)
# Plot uncertainty statistics
plt.figure(figsize=(10, 6))
plt.plot(analysis['mean_uncertainty'], label='Mean uncertainty')
plt.fill_between(range(steps+1),
analysis['mean_uncertainty'] - analysis['std_uncertainty'],
analysis['mean_uncertainty'] + analysis['std_uncertainty'],
alpha=0.3, label='±1 std dev')
plt.plot(analysis['min_uncertainty'], 'g--', label='Min uncertainty')
plt.plot(analysis['max_uncertainty'], 'r--', label='Max uncertainty')
plt.axhline(y=min_uncertainty_product, color='k', linestyle=':', label='Quantum limit')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Uncertainty Product (ΔxΔp)')
plt.title(f'Uncertainty Evolution Statistics for {n_pairs} Pairs')
plt.legend()
plt.grid(True)
# Plot quantum-classical transition
plt.figure(figsize=(10, 6))
plt.plot(analysis['quantum_fraction'] * 100)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Percentage of Pairs (%)')
plt.title('Percentage of Pairs in Quantum-like Regime')
plt.ylim(0, 100)
plt.grid(True)
# Plot representative pairs from each cluster
plt.figure(figsize=(12, 8))
for i, pair_idx in enumerate(analysis['representative_pairs']):
cluster_idx = analysis['cluster_labels'][pair_idx]
count = analysis['cluster_counts'][cluster_idx]
plt.plot(uncertainty_metrics['uncertainty_products'][:, pair_idx],
label=f'Cluster {i+1} ({count} pairs, {count/n_pairs*100:.1f}%)')
plt.axhline(y=min_uncertainty_product, color='k', linestyle=':', label='Quantum limit')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Uncertainty Product ($\Delta x \Delta p$)')
plt.title('Representative Uncertainty Trajectories from Each Cluster')
plt.legend()
plt.grid(True)
# Visualize uncertainty ellipses for representative pairs
if len(sampled_states) > 0:
# Get indices of sampled steps
sampled_steps = list(range(0, steps+1, (steps+1)//len(sampled_states)))
if sampled_steps[-1] != steps:
sampled_steps[-1] = steps
# Only visualize a few representative pairs
pairs_to_visualize = analysis['representative_pairs'][:min(4, len(analysis['representative_pairs']))]
fig, axes = plt.subplots(len(pairs_to_visualize), len(sampled_states),
figsize=(4*len(sampled_states), 3*len(pairs_to_visualize)))
# Handle case of single pair or single step
if len(pairs_to_visualize) == 1:
axes = axes.reshape(1, -1)
if len(sampled_states) == 1:
axes = axes.reshape(-1, 1)
for row, pair_idx in enumerate(pairs_to_visualize):
for col, step_idx in enumerate(range(len(sampled_states))):
ax = axes[row, col]
eigenvalues = sampled_states[step_idx]
# Extract eigenvalues for this pair
idx1, idx2 = 2*pair_idx, 2*pair_idx+1
pos_eigenvalue = eigenvalues[idx1]
mom_eigenvalue = eigenvalues[idx2]
# Convert precision eigenvalues to covariance eigenvalues
cov_eigenvalues = np.array([1/pos_eigenvalue, 1/mom_eigenvalue])
# Calculate ellipse parameters (assuming principal axes aligned with coordinate axes)
width, height = 2 * np.sqrt(cov_eigenvalues)
# Create ellipse
ellipse = Ellipse((0, 0), width=width, height=height, angle=0,
edgecolor='blue', facecolor='lightblue', alpha=0.5)
# Add to plot
ax.add_patch(ellipse)
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.set_aspect('equal')
ax.grid(True)
# Add minimum uncertainty circle
min_circle = plt.Circle((0, 0), min_uncertainty_product,
fill=False, color='red', linestyle='--')
ax.add_patch(min_circle)
# Compute uncertainty product
uncertainty_product = np.sqrt(1/(pos_eigenvalue * mom_eigenvalue))
# Determine regime
if abs(uncertainty_product - min_uncertainty_product) < 0.1*min_uncertainty_product:
regime = "Quantum-like"
color = 'red'
else:
regime = "Classical-like"
color = 'blue'
# Add labels
if row == 0:
step_num = sampled_steps[step_idx]
ax.set_title(f"Step {step_num}")
if col == 0:
cluster_idx = analysis['cluster_labels'][pair_idx]
count = analysis['cluster_counts'][cluster_idx]
ax.set_ylabel(f"Cluster {row+1}\n({count} pairs)")
# Add uncertainty product text
ax.text(0.05, 0.95, f"ΔxΔp = {uncertainty_product:.2f}",
transform=ax.transAxes, fontsize=10, verticalalignment='top')
# Add regime text
ax.text(0.05, 0.85, regime, transform=ax.transAxes,
fontsize=10, verticalalignment='top', color=color)
ax.set_xlabel("Position")
ax.set_ylabel("Momentum")
plt.tight_layout()
In large-scale systems, we observe several emergent phenomena that aren’t apparent in smaller systems:
Statistical phase transitions: As the system evolves, we observe a gradual transition from predominantly quantum-like behavior to predominantly classical-like behavior. This transition resembles a phase transition in statistical physics.
Natural clustering: The thousands of position-momentum pairs naturally organize into clusters with similar dynamical behaviors. Some clusters maintain quantum-like characteristics for longer periods, while others quickly transition to classical-like behavior.
Scale-invariant patterns: The statistical properties of the system show remarkable consistency across different scales, suggesting underlying universal principles in the entropy-uncertainty relationship.
The quantum-classical boundary, which appears sharp in small systems, becomes a statistical property in large systems. At any given time, some fraction of the system exhibits quantum-like behavior while the remainder shows classical-like characteristics. This fraction evolves over time, creating a dynamic boundary between quantum and classical regimes.
The clustering analysis reveals natural groupings of position-momentum pairs based on their dynamical trajectories. These clusters represent different “modes” of behavior within the large system, with some modes maintaining quantum coherence for longer periods while others quickly decohere into classical-like states.
import numpy as np
from scipy.linalg import eigh
from sklearn.cluster import KMeans
# Constants
hbar = 1.0 # Normalized Planck's constant
min_uncertainty_product = hbar/2
# Perform gradient check on a smaller test system
print("Performing gradient check for large system implementation:")
gradient_error = check_large_system_gradient(n_pairs=10)
print(f"Gradient check completed with maximum error: {gradient_error:.8f}")
# Run large-scale simulation
n_pairs = 5000 # 5000 position-momentum pairs (10,000×10,000 matrix)
steps = 100 # Fewer steps for large system
# Run the optimized implementation
sampled_states, entropy_history, uncertainty_metrics = large_scale_gradient_ascent(
n_pairs=n_pairs, steps=steps, learning_rate=0.01, sample_interval=5)
# Analyze results
analysis = analyze_large_system(uncertainty_metrics, n_pairs, steps)
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
from matplotlib.patches import Ellipse, Circle
# Visualize results
visualize_large_system(sampled_states, entropy_history, uncertainty_metrics,
analysis, n_pairs, steps)
# Additional plot: Phase transition visualization
plt.figure(figsize=(10, 6))
quantum_fraction = analysis['quantum_fraction'] * 100
classical_fraction = 100 - quantum_fraction
plt.stackplot(range(steps+1),
[quantum_fraction, classical_fraction],
labels=['Quantum-like', 'Classical-like'],
colors=['red', 'blue'], alpha=0.7)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Percentage of System (%)')
plt.title('Quantum-Classical Phase Transition')
plt.legend(loc='center right')
plt.ylim(0, 100)
plt.grid(True)
mlai.write_figure(filename='large-scale-gradient-ascent-quantum-classical-phase-transition.svg',
directory='./information-game')
Figure: Large-scale gradient ascent reveals a quantum-classical phase transition.
The large-scale simulation reveals how microscopic uncertainty constraints lead to macroscopic statistical patterns. The system naturally organizes into regions of quantum-like and classical-like behavior, with a dynamic boundary that evolves over time.
This perspective provides a new way to understand the quantum-classical transition not as a sharp boundary, but as a statistical property of large systems. The fraction of the system exhibiting quantum-like behavior gradually decreases as entropy increases, creating a smooth transition between quantum and classical regimes.
The clustering analysis identifies natural groupings of position-momentum pairs based on their dynamical trajectories. These clusters represent different “modes” of behavior within the large system, with some modes maintaining quantum coherence for longer periods while others quickly decohere into classical-like states.
This approach to large-scale quantum-classical systems provides a powerful framework for understanding how microscopic quantum constraints manifest in macroscopic statistical behaviors. It bridges quantum mechanics and statistical physics through the common language of information theory and entropy.
[edit]
To illustrate saddle points and information reservoirs, we need at least a 4-bin system. This creates a 3-dimensional parameter space where we can observe genuine saddle points.
Consider a 4-bin system parameterized by natural parameters θ1, θ2, and θ3 (with one constraint). A saddle point occurs where the gradient ∇θS=0, but the Hessian has mixed eigenvalues - some positive, some negative.
At these points, the Fisher information matrix G(θ) eigendecomposition reveals.
The eigenvectors of G(θ) at the saddle point determine which parameter combinations form information reservoirs.
import numpy as np
# Exponential family entropy functions for 4-bin system
def exponential_family_entropy(theta):
"""
Compute entropy of a 4-bin exponential family distribution
parameterized by natural parameters theta
"""
# Compute the log-partition function (normalization constant)
log_Z = np.log(np.sum(np.exp(theta)))
# Compute probabilities
p = np.exp(theta - log_Z)
# Compute entropy: -sum(p_i * log(p_i))
entropy = -np.sum(p * np.log(p), where=p>0)
return entropy
def entropy_gradient(theta):
"""
Compute the gradient of the entropy with respect to theta
"""
# Compute the log-partition function (normalization constant)
log_Z = np.log(np.sum(np.exp(theta)))
# Compute probabilities
p = np.exp(theta - log_Z)
# Gradient is theta times the second derivative of log partition function
return -p*theta + p*(np.dot(p, theta))
# Add a gradient check function
def check_gradient(theta, epsilon=1e-6):
"""
Check the analytical gradient against numerical gradient
"""
# Compute analytical gradient
analytical_grad = entropy_gradient(theta)
# Compute numerical gradient
numerical_grad = np.zeros_like(theta)
for i in range(len(theta)):
theta_plus = theta.copy()
theta_plus[i] += epsilon
entropy_plus = exponential_family_entropy(theta_plus)
theta_minus = theta.copy()
theta_minus[i] -= epsilon
entropy_minus = exponential_family_entropy(theta_minus)
numerical_grad[i] = (entropy_plus - entropy_minus) / (2 * epsilon)
# Compare
print("Analytical gradient:", analytical_grad)
print("Numerical gradient:", numerical_grad)
print("Difference:", np.abs(analytical_grad - numerical_grad))
return analytical_grad, numerical_grad
# Project gradient to respect constraints (sum of theta is constant)
def project_gradient(theta, grad):
"""
Project gradient to ensure sum constraint is respected
"""
# Project to space where sum of components is zero
return grad - np.mean(grad)
# Perform gradient ascent on entropy
def gradient_ascent_four_bin(theta_init, steps=100, learning_rate=1):
"""
Perform gradient ascent on entropy for 4-bin system
"""
theta = theta_init.copy()
theta_history = [theta.copy()]
entropy_history = [exponential_family_entropy(theta)]
for _ in range(steps):
# Compute gradient
grad = entropy_gradient(theta)
proj_grad = project_gradient(theta, grad)
# Update parameters
theta += learning_rate * proj_grad
# Store history
theta_history.append(theta.copy())
entropy_history.append(exponential_family_entropy(theta))
return np.array(theta_history), np.array(entropy_history)
# Test the gradient calculation
test_theta = np.array([0.5, -0.3, 0.1, -0.3])
test_theta = test_theta - np.mean(test_theta) # Ensure constraint is satisfied
print("Testing gradient calculation:")
analytical_grad, numerical_grad = check_gradient(test_theta)
# Verify if we're ascending or descending
entropy_before = exponential_family_entropy(test_theta)
step_size = 0.01
test_theta_after = test_theta + step_size * analytical_grad
entropy_after = exponential_family_entropy(test_theta_after)
print(f"Entropy before step: {entropy_before}")
print(f"Entropy after step: {entropy_after}")
print(f"Change in entropy: {entropy_after - entropy_before}")
if entropy_after > entropy_before:
print("We are ascending the entropy gradient")
else:
print("We are descending the entropy gradient")
# Initialize with asymmetric distribution (away from saddle point)
theta_init = np.array([1.0, -0.5, -0.2, -0.3])
theta_init = theta_init - np.mean(theta_init) # Ensure constraint is satisfied
# Run gradient ascent
theta_history, entropy_history = gradient_ascent_four_bin(theta_init, steps=100, learning_rate=1.0)
# Create a grid for visualization
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
# Compute entropy at each grid point (with constraint on theta3 and theta4)
Z = np.zeros_like(X)
for i in range(X.shape[0]):
for j in range(X.shape[1]):
# Create full theta vector with constraint that sum is zero
theta1, theta2 = X[i,j], Y[i,j]
theta3 = -0.5 * (theta1 + theta2)
theta4 = -0.5 * (theta1 + theta2)
theta = np.array([theta1, theta2, theta3, theta4])
Z[i,j] = exponential_family_entropy(theta)
# Compute gradient field
dX = np.zeros_like(X)
dY = np.zeros_like(Y)
for i in range(X.shape[0]):
for j in range(X.shape[1]):
# Create full theta vector with constraint
theta1, theta2 = X[i,j], Y[i,j]
theta3 = -0.5 * (theta1 + theta2)
theta4 = -0.5 * (theta1 + theta2)
theta = np.array([theta1, theta2, theta3, theta4])
# Get full gradient and project
grad = entropy_gradient(theta)
proj_grad = project_gradient(theta, grad)
# Store first two components
dX[i,j] = proj_grad[0]
dY[i,j] = proj_grad[1]
# Normalize gradient vectors for better visualization
norm = np.sqrt(dX**2 + dY**2)
# Avoid division by zero
norm = np.where(norm < 1e-10, 1e-10, norm)
dX_norm = dX / norm
dY_norm = dY / norm
# A few gradient vectors for visualization
stride = 10
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
fig = plt.figure(figsize=plot.big_wide_figsize)
# Create contour lines only (no filled contours)
contours = plt.contour(X, Y, Z, levels=15, colors='black', linewidths=0.8)
plt.clabel(contours, inline=True, fontsize=8, fmt='%.2f')
# Add gradient vectors (normalized for direction, but scaled by magnitude for visibility)
plt.quiver(X[::stride, ::stride], Y[::stride, ::stride],
dX_norm[::stride, ::stride], dY_norm[::stride, ::stride],
color='r', scale=30, width=0.003, scale_units='width')
# Plot the gradient ascent trajectory
plt.plot(theta_history[:, 0], theta_history[:, 1], 'b-', linewidth=2,
label='Gradient Ascent Path')
plt.scatter(theta_history[0, 0], theta_history[0, 1], color='green', s=100,
marker='o', label='Start')
plt.scatter(theta_history[-1, 0], theta_history[-1, 1], color='purple', s=100,
marker='*', label='End')
# Add labels and title
plt.xlabel('$\\theta_1$')
plt.ylabel('$\\theta_2$')
plt.title('Entropy Contours with Gradient Field')
# Mark the saddle point (approximately at origin for this system)
plt.scatter([0], [0], color='yellow', s=100, marker='*',
edgecolor='black', zorder=10, label='Saddle Point')
plt.legend()
mlai.write_figure(filename='simplified-saddle-point-example.svg',
directory = './information-game')
# Plot entropy evolution during gradient ascent
plt.figure(figsize=plot.big_figsize)
plt.plot(entropy_history)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title('Entropy Evolution During Gradient Ascent')
plt.grid(True)
mlai.write_figure(filename='four-bin-entropy-evolution.svg',
directory = './information-game')
Figure: Visualisation of a saddle point projected down to two dimensions.
Figure: Entropy evolution during gradient ascent on the four-bin system.
The animation of system evolution would show initial rapid movement along high-eigenvalue directions, progressive slowing in directions with low eigenvalues and formation of information reservoirs in the critically slowed directions. Parameter-capacity uncertainty emerges naturally at the saddle point.
[edit]
Saddle points represent critical transitions in the game’s evolution where the gradient ∇θS≈0 but the game is not at a maximum or minimum. At these points.
This creates a natural separation between “memory” variables (associated with near-zero eigenvalues) and “processing” variables (associated with large eigenvalues). The game’s behavior becomes highly non-isotropic in parameter space.
At saddle points, direct gradient ascent stalls, and the game must leverage the Fourier duality between parameters and capacity variables to continue entropy production. The duality relationship c(M)=F[θ(M)]
These saddle points often coincide with phase transitions between parameter-dominated and capacity-dominated regimes, where the game’s fundamental character changes in terms of information processing capabilities.
At saddle points, we see the first manifestation of the uncertainty principle that will be explored in more detail. The relationship between parameters and capacity variables becomes important as the game navigates these critical regions. The Fourier duality relationship c(M)=F[θ(M)]
The emergence of critically slowed directions at saddle points directly leads to the formation of information reservoirs that we’ll explore in depth. These reservoirs form when certain parameter combinations become effectively “frozen” due to near-zero eigenvalues in the Fisher information matrix. This natural separation of timescales creates a hierarchical memory structure that resembles biological information processing systems, where different variables operate at different temporal scales. The game’s deliberate use of steepest ascent rather than natural gradient ensures these reservoirs form organically as the system evolves.
In the game’s evolution, we follow steepest ascent in parameter space to maximize entropy. Let’s contrast with the natural gradient approach that is often used in information geometry.
The steepest ascent direction in Euclidean space is given by, Δθsteepest=η∇θS=ηg
In contrast, the natural gradient adjusts the update direction according to the Fisher information geometry, Δθnatural=ηG(θ)−1∇θS=ηG(θ)−1g
Steepest ascent slows dramatically in directions where the gradient is small, leading to extremely slow progress along the critically slowed modes. This actually helps the game by preserving information in these modes while allowing continued evolution in other directions.
Natural gradient would normalize the updates by the Fisher information, potentially accelerating progress in critically slowed directions. This would destroy the natural emergence of information reservoirs that we desire.
The use of steepest ascent rather than natural gradient is deliberate in our game. It allows the Fisher information matrix’s eigenvalue structure to directly influence the temporal dynamics, creating a natural separation of timescales that preserves information in critically slowed modes while allowing rapid evolution in others.
As the game approaches a saddle point
The gradient ∇θS approaches zero in some directions but remains non-zero in others
The eigendecomposition of the Fisher information matrix G(θ)=VΛVT reveals which directions are critically slowed
Update magnitudes in different directions become proportional to their corresponding eigenvalues
This creates the hierarchical timescale separation that forms the basis of our memory structure
This behavior creates a computational architecture where different variables naturally assume different functional roles based on their update dynamics, without requiring explicit design. The information geometry of the parameter space, combined with steepest ascent dynamics, self-organizes the game into memory and processing components.
The saddle point dynamics in Jaynes’ World provide a mathematical framework for understanding how the game navigates the information landscapes. The balance between fast-evolving “processing” variables and slow-evolving “memory” variables offers insights into how complexity might emerge in environments that instantaneously maximise entropy.
[edit]
The steepest ascent dynamics in our system naturally connect to least action principles in physics. We can demonstrate this connection through visualizing how our uncertainty ellipses evolve along paths of steepest entropy increase.
For our entropy game, we can define an information-theoretic action, A[γ]=∫T0(˙θ⋅∇θS−12‖˙θ‖2)dt
This is exactly what our steepest ascent dynamics implement: the system follows the entropy gradient, with the learning rate controlling the size of parameter updates. As the system evolves, it naturally creates information reservoirs in directions where the gradient is small but non-zero.
import numpy as np
def simulate_action_path(G_init, steps=100, learning_rate=0.01):
"""
Simulate path through parameter space following entropy gradient.
Returns both the path and the entropy production rate.
"""
G = G_init.copy()
path_history = []
entropy_production = []
for _ in range(steps):
# Get current state
eigenvalues, eigenvectors = eigh(G)
# Compute gradient
grad = entropy_gradient(eigenvalues)
proj_grad = project_gradient(eigenvalues, grad)
# Store current point and entropy production rate
path_history.append(eigenvalues.copy())
entropy_production.append(np.dot(proj_grad, grad))
# Update eigenvalues
eigenvalues += learning_rate * proj_grad
eigenvalues = np.maximum(eigenvalues, 1e-10)
# Reconstruct G
G = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
return np.array(path_history), np.array(entropy_production)
# Initialize system with 2 position-momentum pairs
n_pairs = 2
G_init = initialize_multidimensional_state(n_pairs, squeeze_factors=[0.1, 0.2])
# Simulate path
path_history, entropy_production = simulate_action_path(G_init)
# Create figure with two subplots
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
# Plot 1: Path through eigenvalue space
ax.plot(path_history[:, 0], path_history[:, 1], 'r-', label='Pair 1')
ax.plot(path_history[:, 2], path_history[:, 3], 'b-', label='Pair 2')
ax.set_xlabel('Position eigenvalue')
ax.set_ylabel('Momentum eigenvalue')
ax.set_title('Path Through Parameter Space')
ax.legend()
# Add minimum uncertainty hyperbolas
x = np.linspace(0.1, 5, 100)
ax.plot(x, min_uncertainty_product/x, 'k--', alpha=0.5, label='Min uncertainty')
ax.set_xscale('log')
ax.set_yscale('log')
ax.grid(True)
mlai.write_figure(filename='gradient-flow-least-action.svg',
directory='./information-game')
# Plot 2: Uncertainty ellipses at selected points
steps_to_show = [0, 25, 50, -1]
plot_multidimensional_uncertainty(Lambda_history, step_indices=steps_to_show, pairs_to_plot=[0, 1])
ax2.set_title('Evolution of Uncertainty Ellipses')
plt.tight_layout()
mlai.write_figure(filename='gradient-flow-least-action-uncertainty-ellipses.svg',
directory='./information-game')
Figure: Visualization of the gradient flow through parameter space.
Figure: Visualization of the corresponding evolution of uncertainty ellipses (right). The dashed lines show minimum uncertainty bounds.
The action integral governing this evolution can be written: A[γ]=∫T0(˙θ⋅∇θS−12˙θ⊤G˙θ)dt
[edit]
The gradient flow dynamics we’ve been exploring have interesting connections to Roy Frieden’s Extreme Physical Information (EPI) principle. This section explores the formal equivalence between entropy maximization in exponential families and EPI optimization.
In our entropy game, we’ve been maximizing entropy through gradient ascent on the natural parameters θ. For exponential family distributions, the entropy gradient has a particularly elegant form ∇θS(Z)=E[T(Z)]−∇θA(θ),
Roy Frieden’s Extreme Physical Information (EPI) principle proposes that physical laws arise from the optimization of an information-theoretic functional. The EPI functional is defined as, Δ=I(X|θ)−J(M|θ),
For exponential families, the Fisher information can be expressed in terms of the Fisher information matrix I(X|θ)=Tr[GX(θ)]
The formal equivalence between entropy maximization and EPI optimization can be established by examining their equilibrium conditions. For entropy maximization, equilibrium occurs when ∇θS(Z)=E[T(Z)]−∇θA(θ)=0.
For EPI optimization, equilibrium occurs when δΔδρ=constant.
When we express this in terms of natural parameters and apply the calculus of variations, we arrive at the same moment-matching condition E[T(Z)]=∇θA(θ).
This equivalence holds specifically when the system respects certain Markov properties, namely when X and X′ are conditionally independent given M. Under these conditions, both approaches lead to the same equilibrium distribution.
This equivalence can also be understood through the lens of information geometry. The Fisher information matrix G(θ) defines a Riemannian metric on the manifold of probability distributions.
The entropy gradient flow follows geodesics in the dual geometry (mixture geometry), while the EPI optimization follows geodesics in the primal geometry (exponential family geometry). At equilibrium, these paths converge to the same point on the manifold - the maximum entropy distribution subject to the given constraints.
To demonstrate this equivalence computationally, we’ll implement both optimization processes and compare their trajectories through parameter space. The following code simulates a system with observable variables and memory variables, tracking how they evolve under both entropy maximization and EPI optimization.
import numpy as np
from scipy.linalg import eigh
import matplotlib.pyplot as plt
def compute_epi_functional(G, partition_indices):
"""
Compute the EPI functional for a given Fisher information matrix
and partition of variables into observables and memory.
Parameters:
-----------
G: array
Fisher information matrix
partition_indices: list
Indices of observable variables (complement is memory variables)
Returns:
--------
Delta: float
EPI functional value
"""
n = G.shape[0]
obs_indices = np.array(partition_indices)
mem_indices = np.array([i for i in range(n) if i not in obs_indices])
# Extract submatrices
G_obs = G[np.ix_(obs_indices, obs_indices)]
G_mem = G[np.ix_(mem_indices, mem_indices)]
# Compute trace of each submatrix
I_obs = np.trace(G_obs)
J_mem = np.trace(G_mem)
return I_obs - J_mem
The compute_epi_functional
function calculates Frieden’s EPI
functional (Δ = I - J) by extracting the Fisher information submatrices
for observable and memory variables, computing their traces, and
returning the difference. This implements the mathematical definition of
the EPI functional for our computational experiment.
def epi_gradient(G, partition_indices):
"""
Compute gradient of EPI functional with respect to parameters.
Parameters:
-----------
G: array
Fisher information matrix
partition_indices: list
Indices of observable variables
Returns:
--------
gradient: array
Gradient of EPI functional
"""
n = G.shape[0]
gradient = np.zeros(n)
obs_indices = np.array(partition_indices)
mem_indices = np.array([i for i in range(n) if i not in obs_indices])
# Set gradient components (simplified model)
gradient[obs_indices] = -1.0 # Minimize Fisher information for observables
gradient[mem_indices] = 1.0 # Maximize Fisher information for memory
return gradient
The epi_gradient
function computes the direction for minimizing the
EPI functional. For observable variables, we want to minimize Fisher
information (reducing uncertainty), while for memory variables, we want
to maximize it (increasing capacity). This gradient guides the system
toward the equilibrium where observable and memory variables reach an
optimal information balance.
def compare_entropy_and_epi_paths(G_init, partition_indices, steps=100, learning_rate=0.01):
"""
Compare paths of entropy maximization and EPI optimization.
Parameters:
-----------
G_init: array
Initial Fisher information matrix
partition_indices: list
Indices of observable variables
steps: int
Number of gradient steps
learning_rate: float
Step size for gradient updates
Returns:
--------
entropy_path: array
Path through parameter space under entropy maximization
epi_path: array
Path through parameter space under EPI optimization
"""
# Initialize
G_entropy = G_init.copy()
G_epi = G_init.copy()
entropy_path = []
epi_path = []
for _ in range(steps):
# Entropy maximization step
eigenvalues_entropy, eigenvectors_entropy = eigh(G_entropy)
entropy_grad = entropy_gradient(eigenvalues_entropy)
eigenvalues_entropy += learning_rate * entropy_grad
eigenvalues_entropy = np.maximum(eigenvalues_entropy, 1e-10)
G_entropy = eigenvectors_entropy @ np.diag(eigenvalues_entropy) @ eigenvectors_entropy.T
entropy_path.append(eigenvalues_entropy.copy())
# EPI optimization step
eigenvalues_epi, eigenvectors_epi = eigh(G_epi)
epi_grad = epi_gradient(G_epi, partition_indices)
projected_epi_grad = project_gradient(eigenvalues_epi, epi_grad)
eigenvalues_epi += learning_rate * projected_epi_grad
eigenvalues_epi = np.maximum(eigenvalues_epi, 1e-10)
G_epi = eigenvectors_epi @ np.diag(eigenvalues_epi) @ eigenvectors_epi.T
epi_path.append(eigenvalues_epi.copy())
return np.array(entropy_path), np.array(epi_path)
The compare_entropy_and_epi_paths
function is our main simulation
function. It runs both optimization processes in parallel, tracking the
eigenvalues of the Fisher information matrix at each step. This allows
us to compare how the two approaches navigate through parameter space.
While they may take different paths, our theoretical analysis suggests
they should reach similar equilibrium states.
This implementation builds on the InformationReservoir
class from
previous examples, but generalizes to higher dimensions with multiple
position-momentum pairs. It extends the concept of uncertainty relations
to track how these uncertainties evolve under both entropy maximization
and EPI optimization.
# Initialize system with 2 position-momentum pairs
n_pairs = 2
G_init = initialize_multidimensional_state(n_pairs, squeeze_factors=[0.1, 0.2])
# Define partition: first pair is observable, second pair is memory
partition_indices = [0, 1] # Indices of first position-momentum pair
# Compare paths
entropy_path, epi_path = compare_entropy_and_epi_paths(G_init, partition_indices)
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Paths through eigenvalue space for first pair
ax1.plot(entropy_path[:, 0], entropy_path[:, 1], 'r-', label='Entropy Max')
ax1.plot(epi_path[:, 0], epi_path[:, 1], 'b--', label='EPI Min')
ax1.set_xlabel('Position eigenvalue')
ax1.set_ylabel('Momentum eigenvalue')
ax1.set_title('Observable Variables Path')
ax1.legend()
ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.grid(True)
# Plot 2: Paths through eigenvalue space for second pair
ax2.plot(entropy_path[:, 2], entropy_path[:, 3], 'r-', label='Entropy Max')
ax2.plot(epi_path[:, 2], epi_path[:, 3], 'b--', label='EPI Min')
ax2.set_xlabel('Position eigenvalue')
ax2.set_ylabel('Momentum eigenvalue')
ax2.set_title('Memory Variables Path')
ax2.legend()
ax2.set_xscale('log')
ax2.set_yscale('log')
ax2.grid(True)
plt.tight_layout()
mlai.write_figure(filename='epi-entropy-comparison.svg',
directory='./information-game')
Figure: Comparison of parameter paths under entropy maximization (red solid line) and EPI optimization (blue dashed line) for observable variables (left) and memory variables (right).
The figure above compares the paths taken through parameter space under entropy maximization versus EPI optimization. For observable variables (left), both approaches lead to similar equilibrium states but may follow different trajectories. For memory variables (right), the paths can diverge more significantly depending on the specific constraints and initial conditions.
Despite these potential differences in trajectories, both approaches ultimately identify information-theoretic optima that balance uncertainty between different parts of the system. This computational demonstration supports the theoretical equivalence we established in the mathematical derivation.
This visualization shows that while the paths taken by entropy maximization (red solid line) and EPI optimization (blue dashed line) may differ, they ultimately reach similar equilibrium states. This provides concrete evidence for the abstract mathematical equivalence discussed in the theoretical section.
This connection to Frieden’s work also relates to our earlier discussion of least action principles. The EPI principle can be viewed as a special case of a more general variational principle, where the “action” being minimized is an information-theoretic functional rather than a physical one. This reinforces the idea that information geometry provides a natural framework for understanding both physical and information-theoretic dynamics.
[edit]
For the system to ‘spontaneously organise’ we need to understand how mutual information evolves under our dynamics.
We’re maximizing entropy in the natural parameter space θ, not directly in probability space. This distinction is crucial - while maximizing entropy in probability space would lead to independence between variables, maximizing entropy in natural parameter space can simultaneously increase both joint entropy and mutual information.
To make this notion of “organization” more concrete, we should consider:
Joint distribution over variables Z=(X,M), where M represents memory variables in an information reservoir (at a saddle point in the dynamics) and X represents observable variables. The system evolves by maximizing entropy S in the natural parameter space θ, dθdt=η∇θS[p(z,t)].
We know dSdt>0 (entropy is being maximized), and because M are at saddle points we know that dS(M)dt≈0. Therefore we can rearrange to find, dI(X;M)dt≈dS(X)dt−dSdt.
We introduce the Fisher information and the effect of multiple timescales to analyze when the gradient condition dS(X)dt>dSdt holds.
The Fisher information matrix G(θ) provides a natural metric on the statistical manifold of probability distributions. For our joint distribution p(z|θ), the Fisher information is defined as Gij(θ)=E[∂logp(z|θ)∂θi∂logp(z|θ)∂θj].
When we partition our variables into fast variables X and slow variables M (representing the information reservoir), we are suggesting a a timescale separation in the natural parameter dynamics, dθXdt=ηX∇θXS[p(z,t)],
This timescale separation reflects an asymmetry that would drive spontaneous organization. The entropy dynamics can be expressed in terms of the Fisher information matrix and the natural parameter velocities, dSdt=∇θS⋅dθdt=∇θXS⋅dθXdt+∇θMS⋅dθMdt.
Substituting our gradient ascent dynamics with different learning rates: dSdt=∇θXS⋅(ηX∇θXS)+∇θMS⋅(ηM∇θMS)=ηX‖∇θXS‖2+ηM‖∇θMS‖2.
Similarly, the marginal entropy of X evolves according to, dS(X)dt=∇θXS(X)⋅dθXdt=∇θXS(X)⋅(ηX∇θXS)=ηX∇θXS(X)⋅∇θXS.
Note that this is not generally equal to ηX‖∇θXS(X)‖2 unless ∇θXS=∇θXS(X), which is not typically the case when variables are correlated.
The gradient condition for spontaneous organization, dI(X;M)dt>0, can be rewritten using our earlier relation dI(X;M)dt≈dS(X)dt−dSdt,
Since ηM‖∇θMS‖2>0 (except exactly at saddle points), this inequality requires: ∇θXS(X)⋅∇θXS>‖∇θXS‖2.
This is a stronger condition than simply requiring the gradients to be aligned. By the Cauchy-Schwarz inequality, we know that ∇θXS(X)⋅∇θXS≤‖∇θXS(X)‖⋅‖∇θXS‖. Therefore, the condition can only be satisfied when ‖∇θXS(X)‖>‖∇θXS‖ and the gradients are sufficiently aligned.
This inequality suggests that spontaneous organization occurs when the gradient of marginal entropy S(X) with respect to θX has a larger magnitude than the gradient of joint entropy S with respect to the same parameters.
This condition can be satisfied when X variables are strongly coupled to M variables in a specific way. We express the mutual information gradient ∇θXI(X;M)=∇θXS(X)+∇θXS(M)−∇θXS.
Since M evolves slowly, we can approximate ∇θXS(M)≈0, yielding ∇θXI(X;M)≈∇θXS(X)−∇θXS.
Our condition for spontaneous organization can be rewritten as ‖∇θXS(X)‖2>‖∇θXS‖2.
We can expand this condition using the relationship between these gradients. Since ∇θXI(X;M)≈∇θXS(X)−∇θXS, we can write ‖∇θXS(X)‖2=‖∇θXS+∇θXI(X;M)‖2.
Expanding this squared norm we have ‖∇θXS(X)‖2=‖∇θXS‖2+‖∇θXI(X;M)‖2+2∇θXS⋅∇θXI(X;M).
For our condition ‖∇θXS(X)‖2>‖∇θXS‖2 to be satisfied, we need ‖∇θXI(X;M)‖2+2∇θXS⋅∇θXI(X;M)>0
To analyze when this condition holds, we must examine the Fisher information geometry near saddle points. At a saddle point of the entropy landscape, the Hessian matrix of the entropy has both positive and negative eigenvalues. The Fisher information matrix G(θ) provides the natural metric on this statistical manifold.
Near a saddle point, the Fisher information matrix exhibits a characteristic eigenvalue spectrum with a separation between large and small eigenvalues. The eigenvectors corresponding to small eigenvalues define the slow manifold (associated with memory variables M), while those with large eigenvalues correspond to fast-evolving directions (associated with observable variables X).
The gradient of joint entropy can be decomposed into components along these eigendirections. Due to the timescale separation, the gradient components along fast directions quickly equilibrate, while components along slow directions persist. This creates a scenario where:
Under these conditions, the dot product ∇θXS⋅∇θXI(X;M) can become positive when the entropy gradient aligns with directions that increase mutual information. This alignment is not random but emerges deterministically in specific regions of the parameter space, particularly near saddle points where the eigenvalue spectrum of the Fisher information matrix exhibits a clear separation between fast and slow modes. As the system evolves toward these saddle points, it naturally enters configurations where the alignment condition is satisfied due to the geometric properties of the entropy landscape.
This analysis identifies the conditions under which spontaneous organisation becomes possible within the framework of entropy maximization in natural parameter space. The key insight is that the geometry of the Fisher information near saddle points creates regions where entropy maximization and mutual information may occur simultaneously.
This timescale separation enables an adiabatic elimination process where fast variables X reach a quasi-equilibrium for each slow configuration of M. This creates effective dynamics where M adapts to encode statistical regularities in the behavior of X.
Mathematically, we can express this using the Hessian matrices, HX=∂2S∂θX∂θX,
The condition for spontaneous organization becomes dI(X;M)dt≈ηXtr(HS(X))−ηXtr(HS)−ηMtr(HXM)=−ηMtr(HXM).
This approximation is valid when the system has reached a quasi-equilibrium state for the fast variables X, where ∇θXS≈∇θXS(X). In this regime, the first two terms approximately cancel out, leaving the cross-correlation term dominant. Here, HS(X) is the Hessian of the marginal entropy S(X) with respect to θX, HS is the Hessian of the joint entropy, and HXM is the cross-correlation Hessian measuring how changes in θX affect gradients with respect to θM.
Thus, mutual information increases when tr(HXM)<0, which occurs when the cross-correlation Hessian between X and M has predominantly negative eigenvalues. This represents configurations where joint entropy increases more efficiently by strengthening correlations rather than breaking them.
This provides a precise mathematical characterization of when spontaneous organization emerges from entropy maximization in natural parameter space under multiple timescales.
[edit]
One way to formalize the notion of locality in our information-theoretic framework is through conditional independence structures.
When we have a small number of slow modes (M) that act as information reservoirs, they can induce conditional independence between subsets of fast variables (X), creating a form of locality.
This approach connects our abstract information-theoretic framework to more intuitive notions of spatial organization and modularity without requiring an explicit spatial embedding.
We partition our fast variables X into subsets X={X1,X2,...,XK}, where each Xk might represent variables that are “close” to each other in some abstract sense.
The joint entropy of the entire system can be decomposed as S(X,M)=S(M)+S(X|M)=S(M)+K∑k=1S(Xk|M)−K∑k=1∑j<kI(Xk;Xj|M).
When the slow modes M capture the global structure of the system, the conditional mutual information terms become very small, I(Xk;Xj|M)≈0for j≠k.
For multivariate Gaussian systems, we can formalize this connection precisely. If we consider the precision matrix (inverse covariance) of the joint distribution Λ and partition it according to slow modes M and fast variables X, Λ=[ΛMMΛMXΛXMΛXX].
The degree to which this factorization holds can be quantified by the off-diagonal blocks in ΛX|M. The magnitude of these elements directly determines the conditional mutual information terms I(Xk;Xj|M). The eigenvalue gap between slow and fast modes determines how effectively the slow modes can absorb the dependencies, leading to smaller off-diagonal elements and thus conditional independence.
Importantly, this same principle applies to systems represented by density matrices with quadratic Hamiltonians. For a system with density matrix ρ, we can decompose it as ρ=exp(−H)/Z
The eigendecomposition of J identifies the normal modes of the system: J=UΣUT
For conditional independence in density matrix formalism, when we partition the system into subsystems and condition on the slow modes, the residual couplings between subsystems are determined by the block structure of J after “integrating out” the slow modes. This produces an effective J′ for the subsystems given the slow modes, and the off-diagonal blocks of this effective J′ determine the conditional mutual information between subsystems.
The eigenvalue gap again plays the crucial role: a larger separation between slow and fast eigenvalues allows the slow modes to more effectively absorb the cross-system couplings, leading to an effective J′ that is more block-diagonal and thus creating stronger conditional independence.
For readers interested in the quantum Fisher information perspective, note that for systems with quadratic Hamiltonians, the quantum Fisher information matrix is directly related to the coupling matrix J. Specifically, for a Gaussian quantum state with density matrix ρ=exp(−H)/Z, the quantum Fisher information matrix FQ can be expressed in terms of the second derivatives of the Hamiltonian, [FQ]ij∝∂2H∂θi∂θj,
The non-commutative nature of quantum operators is embedded in the structure of J and consequently in FQ, which affects how information is distributed and how conditional independence structures form in quantum systems compared to classical ones. The symmetry properties of FQ reflect the uncertainty relations inherent in quantum mechanics, providing additional constraints on how effectively slow modes can induce conditional independence.
The connection to the eigenvalue spectrum provides a formal link between the abstract mathematics of the game and intuitive notions of spatial organization.
When the Fisher information matrix has a few eigenvalues that are much smaller than the rest (large separation in the timescales over which the system evolves), the corresponding eigenvectors define the slow modes M. These slow modes act as sufficient statistics for the interactions between different regions of the system.
The conditional independence structure induced by these slow modes creates a graph structure of dependencies. Variables that remain conditionally dependent given M are “closer” to each other than those that become conditionally independent.
This is analogous to how in physics, systems with long-range interactions often have a small number of conserved quantities or order parameters (slow modes) that govern the large-scale behavior, while local fluctuations (fast modes) can be treated as approximately independent when conditioned on these global variables.
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import scipy.stats as stats
import mlai.plot as plot
import matplotlib.gridspec as gridspec
from matplotlib.colors import LinearSegmentedColormap
class ConditionalIndependenceDemo:
def __init__(self, n_clusters=4, n_vars_per_cluster=5, n_slow_modes=2):
"""
Demonstrate how slow modes induce conditional independence structures.
Parameters:
-----------
n_clusters: int
Number of variable clusters (regions)
n_vars_per_cluster: int
Number of variables in each cluster
n_slow_modes: int
Number of slow modes (information reservoir variables)
"""
self.n_clusters = n_clusters
self.n_vars_per_cluster = n_vars_per_cluster
self.n_slow_modes = n_slow_modes
self.n_total_vars = n_clusters * n_vars_per_cluster + n_slow_modes
# Generate a precision matrix with block structure
self.precision = self._generate_precision_matrix()
self.covariance = np.linalg.inv(self.precision)
# Compute eigendecomposition of the precision matrix
self.eigenvalues, self.eigenvectors = np.linalg.eigh(self.precision)
# Identify slow modes (smallest eigenvalues)
self.slow_indices = np.argsort(self.eigenvalues)[:n_slow_modes]
self.fast_indices = np.argsort(self.eigenvalues)[n_slow_modes:]
def _generate_precision_matrix(self):
"""Generate a precision matrix with block structure and slow modes."""
n = self.n_total_vars
# Start with a block diagonal structure for fast variables
precision = np.zeros((n, n))
# Create blocks for each cluster
for i in range(self.n_clusters):
start_idx = i * self.n_vars_per_cluster
end_idx = start_idx + self.n_vars_per_cluster
# Within-cluster connections (strong precision = strong direct dependencies)
block = np.random.uniform(0.7, 1.0,
(self.n_vars_per_cluster, self.n_vars_per_cluster))
block = (block + block.T) / 2 # Make symmetric
np.fill_diagonal(block, 1.0) # Set diagonal to 1
precision[start_idx:end_idx, start_idx:end_idx] = block
# Add slow modes that connect across clusters
slow_start = self.n_clusters * self.n_vars_per_cluster
slow_end = n
# Slow modes have connections to all fast variables
for i in range(slow_start, slow_end):
for j in range(slow_start):
# Weaker connections but present
precision[i, j] = precision[j, i] = np.random.uniform(0.1, 0.3)
# Slow modes are also connected to each other
slow_block = np.random.uniform(0.2, 0.4,
(self.n_slow_modes, self.n_slow_modes))
slow_block = (slow_block + slow_block.T) / 2
np.fill_diagonal(slow_block, 0.5) # Smaller diagonal values = slower modes
precision[slow_start:slow_end, slow_start:slow_end] = slow_block
# Ensure the matrix is positive definite
min_eig = np.min(np.linalg.eigvalsh(precision))
if min_eig <= 0:
precision += np.eye(n) * (abs(min_eig) + 0.01)
return precision
def compute_mutual_information_matrix(self, conditional_on_slow=False):
"""
Compute pairwise mutual information between variables.
Parameters:
-----------
conditional_on_slow: bool
If True, compute conditional mutual information given slow modes
Returns:
--------
mi_matrix: numpy array
Matrix of (conditional) mutual information values
"""
n_fast = self.n_clusters * self.n_vars_per_cluster
mi_matrix = np.zeros((n_fast, n_fast))
if conditional_on_slow:
# Compute conditional mutual information given slow modes
# Using Schur complement to get conditional distribution
slow_idx = slice(n_fast, self.n_total_vars)
fast_idx = slice(0, n_fast)
# Extract blocks
P_ff = self.precision[fast_idx, fast_idx]
# Conditional precision of fast variables given slow variables is
# just the fast block of the precision matrix
cond_precision = P_ff
cond_covariance = np.linalg.inv(cond_precision)
# Compute conditional mutual information from conditional covariance
for i in range(n_fast):
for j in range(i+1, n_fast):
# For multivariate Gaussian, conditional MI is related to partial correlation
partial_corr = -cond_precision[i, j] / np.sqrt(cond_precision[i, i] * cond_precision[j, j])
# Convert to mutual information
mi = -0.5 * np.log(1 - partial_corr**2)
mi_matrix[i, j] = mi_matrix[j, i] = mi
else:
# Compute unconditional mutual information
for i in range(n_fast):
for j in range(i+1, n_fast):
# Extract the 2x2 covariance submatrix
subcov = self.covariance[[i, j]][:, [i, j]]
# For bivariate Gaussian, MI is related to correlation
corr = subcov[0, 1] / np.sqrt(subcov[0, 0] * subcov[1, 1])
# Convert correlation to mutual information
mi = -0.5 * np.log(1 - corr**2)
mi_matrix[i, j] = mi_matrix[j, i] = mi
return mi_matrix
def visualize_conditional_independence(self):
"""Visualize how slow modes induce conditional independence."""
# Compute mutual information matrices
mi_unconditional = self.compute_mutual_information_matrix(conditional_on_slow=False)
mi_conditional = self.compute_mutual_information_matrix(conditional_on_slow=True)
# Create a visualization
fig = plt.figure(figsize=plot.big_wide_figsize)
gs = gridspec.GridSpec(2, 3, height_ratios=[1, 1], width_ratios=[1, 1, 0.1])
# Plot the precision matrix with block structure
ax1 = plt.subplot(gs[0, 0])
im1 = ax1.imshow(self.precision, cmap='viridis')
ax1.set_title('Precision Matrix\nBlock structure with slow modes')
ax1.set_xlabel('Variable index')
ax1.set_ylabel('Variable index')
# Add lines to delineate the blocks
for i in range(1, self.n_clusters):
idx = i * self.n_vars_per_cluster - 0.5
ax1.axhline(y=idx, color='red', linestyle='-', linewidth=0.5)
ax1.axvline(x=idx, color='red', linestyle='-', linewidth=0.5)
# Add line to delineate slow modes
idx = self.n_clusters * self.n_vars_per_cluster - 0.5
ax1.axhline(y=idx, color='red', linestyle='-', linewidth=1.5)
ax1.axvline(x=idx, color='red', linestyle='-', linewidth=1.5)
# Plot eigenvalue spectrum
ax2 = plt.subplot(gs[0, 1])
ax2.plot(range(self.n_total_vars), np.sort(self.eigenvalues), 'o-')
ax2.set_title('Eigenvalue Spectrum\nSmall eigenvalues = slow modes')
ax2.set_xlabel('Index')
ax2.set_ylabel('Eigenvalue')
ax2.set_yscale('log')
ax2.grid(True, alpha=0.3)
# Indicate the separation of eigenvalues
ax2.axvline(x=self.n_slow_modes-0.5, color='red', linestyle='--')
ax2.axhspan(-0.1, self.eigenvalues[self.slow_indices].max()*1.1,
color='blue', alpha=0.2)
ax2.text(self.n_slow_modes/2, self.eigenvalues[self.slow_indices].max()/2,
'Slow Modes', ha='center')
# Create a custom colormap that shows difference more clearly
cmap = LinearSegmentedColormap.from_list('mi_diff',
[(0, 'blue'), (0.5, 'white'), (1, 'red')])
# Plot unconditional mutual information
ax3 = plt.subplot(gs[1, 0])
im3 = ax3.imshow(mi_unconditional, cmap='inferno')
ax3.set_title('Unconditional Mutual Information\nStrong dependencies between regions')
ax3.set_xlabel('Fast variable index')
ax3.set_ylabel('Fast variable index')
# Add lines to delineate the clusters
for i in range(1, self.n_clusters):
idx = i * self.n_vars_per_cluster - 0.5
ax3.axhline(y=idx, color='white', linestyle='-', linewidth=0.5)
ax3.axvline(x=idx, color='white', linestyle='-', linewidth=0.5)
# Plot conditional mutual information
ax4 = plt.subplot(gs[1, 1])
im4 = ax4.imshow(mi_conditional, cmap='inferno')
ax4.set_title('Conditional Mutual Information\nWeaker dependencies after conditioning on slow modes')
ax4.set_xlabel('Fast variable index')
ax4.set_ylabel('Fast variable index')
# Add lines to delineate the clusters
for i in range(1, self.n_clusters):
idx = i * self.n_vars_per_cluster - 0.5
ax4.axhline(y=idx, color='white', linestyle='-', linewidth=0.5)
ax4.axvline(x=idx, color='white', linestyle='-', linewidth=0.5)
# Add colorbar
cax = plt.subplot(gs[1, 2])
cbar = plt.colorbar(im4, cax=cax)
cbar.set_label('Mutual Information')
plt.tight_layout()
return fig
def visualize_dependency_graphs(self, threshold=0.1):
"""
Visualize dependency graphs with and without conditioning on slow modes.
Parameters:
-----------
threshold: float
Threshold for including edges in the graph
"""
# Compute mutual information matrices
mi_unconditional = self.compute_mutual_information_matrix(conditional_on_slow=False)
mi_conditional = self.compute_mutual_information_matrix(conditional_on_slow=True)
# Create dependency graphs
n_fast = self.n_clusters * self.n_vars_per_cluster
G_uncond = nx.Graph()
G_cond = nx.Graph()
# Add nodes
for i in range(n_fast):
cluster_id = i // self.n_vars_per_cluster
# Position nodes in clusters
angle = 2 * np.pi * (i % self.n_vars_per_cluster) / self.n_vars_per_cluster
radius = 1.0
x = (2 + cluster_id % 2) * 3 + radius * np.cos(angle)
y = (cluster_id // 2) * 3 + radius * np.sin(angle)
G_uncond.add_node(i, pos=(x, y), cluster=cluster_id)
G_cond.add_node(i, pos=(x, y), cluster=cluster_id)
# Add edges based on mutual information
for i in range(n_fast):
for j in range(i+1, n_fast):
# Unconditional graph
if mi_unconditional[i, j] > threshold:
G_uncond.add_edge(i, j, weight=mi_unconditional[i, j])
# Conditional graph
if mi_conditional[i, j] > threshold:
G_cond.add_edge(i, j, weight=mi_conditional[i, j])
# Create a visualization
fig = plt.figure(figsize=plot.big_wide_figsize)
# Plot unconditional dependency graph
ax1 = plt.subplot(1, 2, 1)
pos_uncond = nx.get_node_attributes(G_uncond, 'pos')
# Color nodes by cluster
node_colors = [G_uncond.nodes[i]['cluster'] for i in G_uncond.nodes]
# Draw nodes
nx.draw_networkx_nodes(G_uncond, pos_uncond,
node_color=node_colors,
node_size=100,
cmap=plt.cm.tab10,
ax=ax1)
# Draw edges with width proportional to mutual information
edges = G_uncond.edges()
edge_weights = [G_uncond[u][v]['weight']*3 for u, v in edges]
nx.draw_networkx_edges(G_uncond, pos_uncond,
width=edge_weights,
alpha=0.6,
edge_color='gray',
ax=ax1)
# Add node labels
nx.draw_networkx_labels(G_uncond, pos_uncond, font_size=8, ax=ax1)
ax1.set_title('Unconditional Dependency Graph\nMany cross-cluster dependencies')
ax1.set_axis_off()
# Plot conditional dependency graph
ax2 = plt.subplot(1, 2, 2)
pos_cond = nx.get_node_attributes(G_cond, 'pos')
# Color nodes by cluster
node_colors = [G_cond.nodes[i]['cluster'] for i in G_cond.nodes]
# Draw nodes
nx.draw_networkx_nodes(G_cond, pos_cond,
node_color=node_colors,
node_size=100,
cmap=plt.cm.tab10,
ax=ax2)
# Draw edges with width proportional to conditional mutual information
edges = G_cond.edges()
edge_weights = [G_cond[u][v]['weight']*3 for u, v in edges]
nx.draw_networkx_edges(G_cond, pos_cond,
width=edge_weights,
alpha=0.6,
edge_color='gray',
ax=ax2)
# Add node labels
nx.draw_networkx_labels(G_cond, pos_cond, font_size=8, ax=ax2)
ax2.set_title('Conditional Dependency Graph\nMostly within-cluster dependencies remain')
ax2.set_axis_off()
plt.tight_layout()
return fig
# Run the demonstration
np.random.seed(42) # For reproducibility
demo = ConditionalIndependenceDemo(n_clusters=4, n_vars_per_cluster=5, n_slow_modes=2)
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
fig1 = demo.visualize_conditional_independence()
mlai.write_figure(filename='conditional-independence-matrices.svg',
directory='./information-game')
fig2 = demo.visualize_dependency_graphs(threshold=0.1)
mlai.write_figure(filename='conditional-independence-graphs.svg',
directory='./information-game')
Figure: Visualization of how conditioning on slow modes induces independence between clusters of fast variables.
Figure: Dependency graphs before and after conditioning on slow modes, showing the emergence of modularity.
The visualisation demonstrates how conditioning on slow modes drastically reduces the mutual information between variables in different clusters, while preserving dependencies within clusters. This creates a modular structure where each cluster becomes nearly independent given the state of the slow modes.
This modular organization emerges from the eigenvalue structure of the Fisher information matrix, without requiring any explicit spatial embedding or pre-defined notion of locality. The slow modes act as a information bottleneck that encodes the necessary global information while allowing local regions to operate semi-independently.
In a physical system, structures like this manifest as the emergence of spatial patterns or functional modules that interact primarily through a small number of global variables. In a neural network, such structures correspond to the formation of specialized modules that handle different aspects of processing while communicating through a compressed global representation.
The notions of locality in our framework are not about physical distance, but about the conditional independence structure induced by the slow modes of the system. This abstract notion of locality that can be applied to any system where information flows are important.
Building on the conditional independence structure, we can define an “information topography” - a conceptual landscape that characterizes how information flows through the system.
This topography emerges from the pattern of mutual information between variables and their dependency on the slow modes. We can visualize this as a landscape where.
Mathematically, we can define a distance metric between variables based on their conditional mutual information, dI(Xi,Xj)=1I(Xi;Xj|M)+ϵ
The information topography has several important properties.
The eigenvalue spectrum of the Fisher information matrix directly shapes this topography. The larger the separation between the few smallest eigenvalues and the rest, the more pronounced the “ridges” in the topography, leading to stronger locality and modularity.
This perspective allows us to quantify notions like “information distance” and “information barriers” without requiring an explicit spatial embedding, providing a framework for understanding modularity across different types of complex systems.
class InformationTopographyDemo:
def __init__(self, n_clusters=4, n_vars_per_cluster=5, n_slow_modes=2):
"""
Visualize the information topography based on minimal entropy gradient framework.
Parameters:
-----------
n_clusters: int
Number of variable clusters
n_vars_per_cluster: int
Number of variables per cluster
n_slow_modes: int
Number of slow modes that induce conditional independence
"""
self.n_clusters = n_clusters
self.n_vars_per_cluster = n_vars_per_cluster
self.n_vars = n_clusters * n_vars_per_cluster
self.n_slow_modes = n_slow_modes
self.hbar = 1.0
self.min_uncertainty_product = self.hbar / 2
# Initialize system with position-momentum pairs
# Instead of working with CMI directly, we'll build from precision matrix
self.dim = 2 * self.n_vars
self.initialize_precision_matrix()
def initialize_precision_matrix(self):
"""Initialize precision matrix with eigenvalue structure that creates clusters."""
# Create a precision matrix that has a clear eigenvalue structure
# with a few small eigenvalues (slow modes) and many large eigenvalues (fast modes)
# Start with identity matrix
Lambda = np.eye(self.dim)
# Generate random eigenvectors
Q, _ = np.linalg.qr(np.random.randn(self.dim, self.dim))
# Set eigenvalues: a few small ones (slow modes) and many large ones (fast modes)
eigenvalues = np.ones(self.dim)
# Slow modes - small eigenvalues
eigenvalues[:self.n_slow_modes] = 0.1 + 0.1 * np.random.rand(self.n_slow_modes)
# Fast modes - larger eigenvalues organized in clusters
for i in range(self.n_clusters):
cluster_start = self.n_slow_modes + i * self.n_vars_per_cluster
cluster_end = cluster_start + self.n_vars_per_cluster
# Each cluster has similar eigenvalues
base_value = 1.0 + i * 0.5
eigenvalues[cluster_start:cluster_end] = base_value + 0.2 * np.random.rand(self.n_vars_per_cluster)
# Construct precision matrix with this eigenstructure
self.Lambda = Q @ np.diag(eigenvalues) @ Q.T
# Get inverse (covariance matrix)
self.covariance = np.linalg.inv(self.Lambda)
# Store eigendecomposition
self.eigenvalues, self.eigenvectors = np.linalg.eigh(self.Lambda)
self.slow_mode_vectors = self.eigenvectors[:, :self.n_slow_modes]
def compute_conditional_mutual_information(self):
"""Compute conditional mutual information matrix given slow modes."""
# Compute full mutual information from covariance
mi_full = np.zeros((self.n_vars, self.n_vars))
# Compute conditional mutual information given slow modes
mi_conditional = np.zeros((self.n_vars, self.n_vars))
# For each pair of variables (considering position only for simplicity)
for i in range(self.n_vars):
for j in range(i+1, self.n_vars):
# Extract positions from covariance matrix
pos_i, pos_j = i*2, j*2
cov_ij = self.covariance[np.ix_([pos_i, pos_j], [pos_i, pos_j])]
# Compute unconditional mutual information
var_i = self.covariance[pos_i, pos_i]
var_j = self.covariance[pos_j, pos_j]
mi = 0.5 * np.log(var_i * var_j / np.linalg.det(cov_ij))
mi_full[i, j] = mi
mi_full[j, i] = mi
# Compute residual covariance after conditioning on slow modes
cov_i_slow = self.covariance[pos_i, :self.n_slow_modes]
cov_j_slow = self.covariance[pos_j, :self.n_slow_modes]
cov_slow = self.covariance[:self.n_slow_modes, :self.n_slow_modes]
# Schur complement formula for conditional covariance
cov_ij_given_slow = cov_ij - np.array([
[cov_i_slow @ np.linalg.solve(cov_slow, cov_i_slow),
cov_i_slow @ np.linalg.solve(cov_slow, cov_j_slow)],
[cov_j_slow @ np.linalg.solve(cov_slow, cov_i_slow),
cov_j_slow @ np.linalg.solve(cov_slow, cov_j_slow)]
])
# Compute conditional mutual information
var_i_given_slow = cov_ij_given_slow[0, 0]
var_j_given_slow = cov_ij_given_slow[1, 1]
if np.linalg.det(cov_ij_given_slow) > 0: # Numerical stability check
cmi = 0.5 * np.log(var_i_given_slow * var_j_given_slow / np.linalg.det(cov_ij_given_slow))
mi_conditional[i, j] = cmi
mi_conditional[j, i] = cmi
return mi_full, mi_conditional
def compute_information_distance(self, conditional_mi, epsilon=1e-6):
"""Convert conditional mutual information to a distance metric."""
# Higher CMI = closer in information space
# Lower CMI = further apart (conditional independence)
distance = 1.0 / (conditional_mi + epsilon)
np.fill_diagonal(distance, 0) # No self-distance
return distance
def visualize_information_landscape(self):
"""Visualize the information topography as a landscape."""
# Compute conditional mutual information
mi_full, mi_conditional = self.compute_conditional_mutual_information()
# Compute information distance matrix
distance = self.compute_information_distance(mi_conditional)
# Use multidimensional scaling to project the distance matrix to 2D
from sklearn.manifold import MDS
# Apply MDS to embed in 2D space
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=42)
pos = mds.fit_transform(distance)
# Create a visualization
fig = plt.figure(figsize=plot.big_wide_figsize)
# Plot the embedded points with cluster colors
ax = fig.add_subplot(111)
# Assign colors by cluster
colors = plt.cm.tab10(np.linspace(0, 1, self.n_clusters))
for i in range(self.n_vars):
cluster_id = i // self.n_vars_per_cluster
ax.scatter(pos[i, 0], pos[i, 1], c=[colors[cluster_id]],
s=100, label=f"Cluster {cluster_id}" if i % self.n_vars_per_cluster == 0 else "")
ax.text(pos[i, 0] + 0.02, pos[i, 1] + 0.02, str(i), fontsize=9)
# Add connections based on conditional mutual information
# Stronger connections = higher CMI = lower distance
threshold = np.percentile(mi_conditional[mi_conditional > 0], 70) # Only show top 30% strongest connections
for i in range(self.n_vars):
for j in range(i+1, self.n_vars):
if mi_conditional[i, j] > threshold:
# Line width proportional to mutual information
width = mi_conditional[i, j] * 5
ax.plot([pos[i, 0], pos[j, 0]], [pos[i, 1], pos[j, 1]],
'k-', alpha=0.5, linewidth=width)
# Add slow mode projections as gradient in background
# This shows how the slow modes influence the information landscape
grid_resolution = 100
x_min, x_max = pos[:, 0].min() - 0.5, pos[:, 0].max() + 0.5
y_min, y_max = pos[:, 1].min() - 0.5, pos[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, grid_resolution),
np.linspace(y_min, y_max, grid_resolution))
# Interpolate slow mode projection values to the grid
from scipy.interpolate import Rbf
# Use just the first slow mode for visualization
slow_mode_projection = self.slow_mode_vectors[:, 0]
# Extract position variables (even indices)
pos_indices = np.arange(0, self.dim, 2)
pos_slow_projection = slow_mode_projection[pos_indices]
# Normalize for visualization
pos_slow_projection = (pos_slow_projection - pos_slow_projection.min()) / (pos_slow_projection.max() - pos_slow_projection.min())
# Create RBF interpolation
rbf = Rbf(pos[:, 0], pos[:, 1], pos_slow_projection, function='multiquadric')
slow_mode_grid = rbf(xx, yy)
# Plot slow mode influence as background gradient
im = ax.imshow(slow_mode_grid, extent=[x_min, x_max, y_min, y_max],
origin='lower', cmap='viridis', alpha=0.3)
plt.colorbar(im, ax=ax, label='Slow Mode Influence')
# Remove duplicate legend entries
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys(), loc='best')
ax.set_title('Information Topography: Variables Positioned by Information Distance')
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.grid(True, alpha=0.3)
return fig
def visualize_information_landscape_3d(self):
"""Visualize the information topography as a 3D landscape."""
# Compute conditional mutual information
mi_full, mi_conditional = self.compute_conditional_mutual_information()
# Compute information distance matrix
distance = self.compute_information_distance(mi_conditional)
# Use multidimensional scaling to project the distance matrix to 2D
from sklearn.manifold import MDS
from scipy.interpolate import griddata
# Apply MDS to embed in 2D space
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=42)
pos = mds.fit_transform(distance)
# Create a visualization
fig = plt.figure(figsize=plot.big_wide_figsize)
ax = fig.add_subplot(111, projection='3d')
# Assign colors by cluster
colors = plt.cm.tab10(np.linspace(0, 1, self.n_clusters))
# Calculate "elevation" based on connection to slow modes
# Higher elevation = more strongly coupled to slow modes (more global influence)
# First, compute coupling strength to slow modes
slow_mode_coupling = np.zeros(self.n_vars)
for i in range(self.n_vars):
pos_i = i*2
# Project onto slow modes
coupling = np.sum(np.abs(self.eigenvectors[pos_i, :self.n_slow_modes]))
slow_mode_coupling[i] = coupling
# Normalize to [0,1] range
elevation = (slow_mode_coupling - slow_mode_coupling.min()) / (slow_mode_coupling.max() - slow_mode_coupling.min())
# Plot the points in 3D
for i in range(self.n_vars):
cluster_id = i // self.n_vars_per_cluster
ax.scatter(pos[i, 0], pos[i, 1], elevation[i],
c=[colors[cluster_id]], s=100,
label=f"Cluster {cluster_id}" if i % self.n_vars_per_cluster == 0 else "")
ax.text(pos[i, 0], pos[i, 1], elevation[i] + 0.05, str(i), fontsize=9)
# Create a surface representing the information landscape
# Grid the data
xi = np.linspace(pos[:, 0].min(), pos[:, 0].max(), 100)
yi = np.linspace(pos[:, 1].min(), pos[:, 1].max(), 100)
X, Y = np.meshgrid(xi, yi)
# Interpolate elevation for the grid
Z = griddata((pos[:, 0], pos[:, 1]), elevation, (X, Y), method='cubic')
# Plot the surface
surf = ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.6, linewidth=0)
# Add connections based on conditional mutual information
threshold = np.percentile(mi_conditional[mi_conditional > 0], 80)
for i in range(self.n_vars):
for j in range(i+1, self.n_vars):
if mi_conditional[i, j] > threshold:
ax.plot([pos[i, 0], pos[j, 0]],
[pos[i, 1], pos[j, 1]],
[elevation[i], elevation[j]],
'k-', alpha=0.5, linewidth=mi_conditional[i, j]*3)
# Remove duplicate legend entries
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys(), loc='best')
ax.set_title('3D Information Topography: Elevation = Slow Mode Coupling')
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Slow Mode Coupling')
return fig
def visualize_eigenvalue_spectrum(self):
"""Visualize the eigenvalue spectrum showing slow vs. fast modes."""
fig = plt.figure(figsize=plot.big_wide_figsize)
ax = fig.add_subplot(111)
# Plot eigenvalues
eigenvalues = np.sort(self.eigenvalues)
ax.semilogy(range(1, len(eigenvalues)+1), eigenvalues, 'o-')
# Highlight slow modes
ax.semilogy(range(1, self.n_slow_modes+1), eigenvalues[:self.n_slow_modes], 'ro', ms=10, label='Slow Modes')
# Add vertical line separating slow from fast modes
ax.axvline(x=self.n_slow_modes + 0.5, color='k', linestyle='--')
ax.text(self.n_slow_modes + 1, eigenvalues[self.n_slow_modes-1], 'Slow Modes',
ha='left', va='center', fontsize=12)
ax.text(self.n_slow_modes, eigenvalues[self.n_slow_modes], 'Fast Modes',
ha='right', va='center', fontsize=12)
ax.set_xlabel('Index')
ax.set_ylabel('Eigenvalue (log scale)')
ax.set_title('Eigenvalue Spectrum Showing Slow and Fast Modes')
ax.grid(True)
return fig
# Create the information topography visualization
topo_demo = InformationTopographyDemo(n_clusters=4, n_vars_per_cluster=5, n_slow_modes=2)
# Visualize eigenvalue spectrum
fig0 = topo_demo.visualize_eigenvalue_spectrum()
mlai.write_figure(filename='information-topography-eigenspectrum.svg',
directory='./information-game')
fig3 = topo_demo.visualize_information_landscape()
mlai.write_figure(filename='information-topography-2d.svg',
directory='./information-game')
fig4 = topo_demo.visualize_information_landscape_3d()
mlai.write_figure(filename='information-topography-3d.svg',
directory='./information-game')
Figure: Eigenvalue spectrum showing separation between slow and fast modes that shapes the information topography.
Figure: Information topography visualized as a 2D landscape with points positioned according to information distance.
Figure: 3D visualization of the information landscape where elevation represents coupling to slow modes.
The information topography visualizations now directly connect to the minimal entropy gradient framework. The eigenvalue spectrum shows the clear separation between slow and fast modes that shapes the entire information landscape. Variables that are strongly coupled to the same slow modes remain conditionally dependent even after accounting for slow modes, forming natural clusters in the topography.
The 2D landscape reveals how variables cluster based on their conditional information distances, with the background gradient showing the influence of the primary slow mode. The 3D visualization adds another dimension where elevation represents coupling strength to slow modes - variables with higher elevation have more global influence across the system.
This approach demonstrates how the conditional independence structure emerges naturally from the eigenvalue spectrum of the Fisher information matrix. The slow modes act as common causes that induce dependencies between otherwise independent variables, creating a rich information topography with valleys (strong dependencies) and ridges (conditional independence).
The conditional independence framework we’ve developed for spatial or structural organization can be extended naturally to the temporal domain. Just as slow modes induce conditional independence between different regions in space, they also mediate dependencies between different points in time.
If we divide X into past/present X0 and future X1, we can analyze how information flows across time through the slow modes M. The entropy can be decomposed into a Markovian component, where X0 and X1 are conditionally independent given M, and a non-Markovian component. The conditional mutual information is I(X0;X1|M)=∑x0,x1,mp(x0,x1,m)logp(x0,x1|m)p(x0|m)p(x1|m),
When I(X0;X1|M)=0, the system becomes perfectly Markovian - the slow modes capture all dependencies between past and future. This is analogous to how these same slow modes create conditional independence between spatial regions. The eigenvalue structure of the Fisher information matrix that gives rise to spatial modularity also determines the temporal memory capacity of the system.
Just as there is an information topography in space, we can define a temporal information landscape where “distance” corresponds to conditional mutual information between variables at different time points given M. Temporal watersheds emerge where the slow modes fail to bridge temporal dependencies, creating effective boundaries in the system’s dynamics.
This framework highlights the tension in information processing systems. The slow modes must simultaneously: 1. Maintain minimal entropy (for efficiency) 2. Induce conditional independence between spatial regions (for modularity) 3. Capture temporal dependencies between past and future (for memory)
These competing objectives create an uncertainty principle: systems cannot simultaneously optimize for all three without trade-offs. Systems with strong spatial modularity may sacrifice temporal memory, while systems with excellent memory may require more complex slow mode structure.
So far, we have analyzed conditional independence structures given a predefined eigenvalue structure. A natural question is: can such structures emerge naturally from more fundamental principles? To address this, we can leverage the gradient ascent framework we developed earlier to demonstrate how conditional independence patterns emerge as the system evolves towards maximum entropy states.
This integration completes our theoretical picture: the eigenvalue structures that lead to locality through conditional independence are not arbitrary mathematical constructions, but natural consequences of entropy maximization under uncertainty constraints.
# Run a large-scale gradient ascent simulation to generate eigenvalue structure
n_clusters = 4
n_vars_per_cluster = 5
n_slow_modes = 2
n_pairs = n_clusters * n_vars_per_cluster # Total number of position-momentum pairs
total_dims = 2 * n_pairs + n_slow_modes # Total system dimensionality
steps = 100
# Initialize with minimal entropy state but with cross-cluster connections
Lambda_init = initialize_multidimensional_state(n_pairs,
squeeze_factors=[0.1 + 0.1*i for i in range(n_pairs)],
with_cross_connections=True)
# Run gradient ascent
eigenvalues_history, entropy_history = gradient_ascent_entropy(Lambda_init, steps, learning_rate=0.01)
# At different stages of gradient ascent, compute conditional independence metrics
stage_indices = [0, steps//4, steps//2, steps-1] # Initial, early, middle, final stages
# Create a conditional independence demo using the evolved eigenvalue structure
ci_demo = ConditionalIndependenceDemo(n_clusters=n_clusters,
n_vars_per_cluster=n_vars_per_cluster,
n_slow_modes=n_slow_modes)
# Track conditional mutual information at different gradient ascent stages
mi_stages = []
for stage in stage_indices:
# Use evolved eigenvalues to construct precision matrix
precision = ci_demo.precision.copy()
# Update diagonal with evolved eigenvalues
np.fill_diagonal(precision, eigenvalues_history[stage])
# Compute mutual information matrices
mi_unconditional = ci_demo.compute_mutual_information_matrix(conditional_on_slow=False)
mi_conditional = ci_demo.compute_mutual_information_matrix(conditional_on_slow=True)
mi_stages.append({
'step': stage,
'unconditional': mi_unconditional,
'conditional': mi_conditional
})
# Visualize how conditional independence emerges through gradient ascent
fig = plt.figure(figsize=(15, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1])
# Plot eigenvalue evolution
ax1 = plt.subplot(gs[0, 0])
for i in range(len(eigenvalues_history[0])):
if i >= 2*n_pairs: # Slow modes
ax1.semilogy(eigenvalues_history[0][:, i], 'r-', alpha=0.7)
else: # Fast variables
ax1.semilogy(eigenvalues_history[0][:, i], 'b-', alpha=0.4)
# Highlight representative eigenvalues
ax1.semilogy(eigenvalues_history[0][:, 0], 'b-', linewidth=2, label='Fast variable')
ax1.semilogy(eigenvalues_history[0][:, -1], 'r-', linewidth=2, label='Slow mode')
ax1.set_xlabel('Gradient Ascent Step')
ax1.set_ylabel('Eigenvalue (log scale)')
ax1.set_title('Eigenvalue Evolution During Gradient Ascent')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot entropy evolution
ax2 = plt.subplot(gs[0, 1])
ax2.plot(entropy_history)
ax2.set_xlabel('Gradient Ascent Step')
ax2.set_ylabel('Entropy')
ax2.set_title('Entropy Evolution')
ax2.grid(True, alpha=0.3)
# Plot conditional vs unconditional mutual information matrices for final stage
final_stage = mi_stages[-1]
# Plot unconditional mutual information
ax3 = plt.subplot(gs[1, 0])
im3 = ax3.imshow(final_stage['unconditional'], cmap='inferno')
ax3.set_title('Final Unconditional\nMutual Information')
ax3.set_xlabel('Fast variable index')
ax3.set_ylabel('Fast variable index')
# Add lines to delineate the clusters
for i in range(1, n_clusters):
idx = i * n_vars_per_cluster - 0.5
ax3.axhline(y=idx, color='white', linestyle='-', linewidth=0.5)
ax3.axvline(x=idx, color='white', linestyle='-', linewidth=0.5)
plt.colorbar(im3, ax=ax3)
# Plot conditional mutual information
ax4 = plt.subplot(gs[1, 1])
im4 = ax4.imshow(final_stage['conditional'], cmap='inferno')
ax4.set_title('Final Conditional Mutual Information\nGiven Slow Modes')
ax4.set_xlabel('Fast variable index')
ax4.set_ylabel('Fast variable index')
# Add lines to delineate the clusters
for i in range(1, n_clusters):
idx = i * n_vars_per_cluster - 0.5
ax4.axhline(y=idx, color='white', linestyle='-', linewidth=0.5)
ax4.axvline(x=idx, color='white', linestyle='-', linewidth=0.5)
plt.colorbar(im4, ax=ax4)
plt.tight_layout()
mlai.write_figure(filename='emergent-conditional-independence.svg',
directory='./information-game')
Figure: Through gradient ascent on entropy, we observe the emergence of eigenvalue structures that lead to conditional independence patterns. The top row shows eigenvalue and entropy evolution during gradient ascent. The bottom row shows the unconditional mutual information (left) and conditional mutual information given slow modes (right) at the final stage.
The experiment results reveal:
Natural eigenvalue separation: As the system evolves toward maximum entropy, we observe a natural separation of eigenvalues into “slow” and “fast” modes. The slow modes (those with small eigenvalues and thus large variances) tend to develop connections across different regions of the system.
Emergent conditional independence: The conditional mutual information matrix shows that, after conditioning on the slow modes, the dependencies between variables from different clusters are significantly reduced. This confirms that the conditional independence structure emerges naturally through entropy maximization.
Block structure in mutual information: Without conditioning, the mutual information matrix shows significant dependencies across different regions. After conditioning on the slow modes, a block structure emerges where variables within the same cluster remain dependent, but cross-cluster dependencies are minimized.
This demonstrates a profound connection: the mathematical structure required for locality through conditional independence is not an artificial construction, but emerges naturally from entropy maximization subject to uncertainty constraints. The slow modes that act as information reservoirs connecting different parts of the system arise as a consequence of the system seeking its maximum entropy configuration while respecting fundamental constraints.
This emergent locality provides a potential explanation for how complex systems can maintain both global coherence (through slow modes) and local autonomy (through conditional independence structures). It suggests that the hierarchical organization observed in many natural and artificial systems may be a natural consequence of information-theoretic principles rather than requiring explicit design.
The game exhibits three properties that emerge from the characteristic structure of the Fisher information matrix: information capacity, modularity, and memory.
Information Capacity: Mathematically expressed through the variances of the slow modes, where σ2i∝1λi. Smaller eigenvalues permit higher variance in corresponding directions, allowing more information to be carried. This capacity arises directly from entropy maximization under uncertainty constraints.
Modularity: Formalized through conditional independence relations I(Xi;Xj|M)≈0 between variables in different modules given the slow modes. When this conditional mutual information approaches zero, the precision matrix develops block structures that mathematically define spatial or functional modules.
Memory: Characterized by the temporal Markov property, where I(X0;X1|M)=0 indicates that slow modes completely mediate dependencies between past and future states. This mathematical condition defines the system’s capacity to preserve relevant information across time.
The interrelationship between these properties can be understood by examining their mathematical definitions. All three depend on the same underlying eigenstructure of the Fisher information matrix, creating inherent constraints. This leads to a mathematical uncertainty relation: where: and k is a system-dependent constant.
Here, C(M) is defined as the sum of the variances σ2i of the slow modes, which is equivalently the sum of reciprocals of the eigenvalues λi of the Fisher information matrix corresponding to these modes. This quantity mathematically represents the total information capacity of the slow modes - how much information they can effectively store or transmit. Higher capacity allows the slow modes to capture more complex dependencies across the system, but may require more physical resources to maintain.
This uncertainty relation emerges from the shared dependence on the eigenstructure. When a system increases the information capacity of slow modes to improve memory, these modes necessarily couple more variables across space, reducing modularity. Conversely, strong modularity requires specific eigenvalue patterns that may constrain the slow modes’ ability to capture temporal dependencies.
When examining the Markov property specifically, we observe that it emerges naturally when the eigenstructure allocates sufficient information capacity to slow modes to mediate temporal dependencies. The emergence or failure of Markovianity can be precisely quantified through I(X0;X1|M), where non-zero values indicate direct information pathways between past and future that bypass the slow mode bottleneck.
This mathematical framework reveals why no system can simultaneously maximize information capacity, modularity, and memory - the constraints are not design limitations but fundamental properties of information geometry. The eigenstructure must balance these properties based on the underlying physics of information propagation through the system.
Modularity and memory represent a duality in information processing systems. While they appear distinct - modularity concerns spatial/functional organization while memory concerns temporal dependencies - they are two manifestations of the same underlying mathematical structure.
Both properties emerge from conditional independence relationships mediated by the slow modes: - Modularity: I(Xi;Xj|M)≈0 for variables in different spatial/functional modules - Memory: I(X0;X1|M)≈0 for variables at different time points
This reveals a symmetry: modularity can be viewed as “spatial memory” where the slow modes maintain information about the relationships between different parts of the system. Conversely, memory can be viewed as “temporal modularity” where the slow modes create effective independence between past and future states, mediated by the present state of the slow modes.
The mathematical structures that support this duality are apparent when we examine dynamical systems over time. The same slow modes that create effective boundaries between spatial modules create bridges across time.
The eigenvalue structure of the Fisher information matrix determines both: 1. How effectively the system partitions into modules (spatial organization) 2. How effectively the system retains relevant information over time (temporal organization)
In hierarchical systems the slow modes at each level of the hierarchy simultaneously.
This perspective provides a unified framework for understanding how information is organized across both space and time in complex systems.
[edit]
Digital memory can be viewed as a communication channel through time - storing a bit is equivalent to transmitting information to a future moment. This perspective immediately suggests that we look for a connection between Landauer’s erasure principle and Shannon’s channel capacity. The connection might arise because both these systems are about maintaining reliable information against thermal noise.
The Landauer limit (Landauer, 1961) is the minimum amount of heat energy that is dissapated when a bit of information is erased. Conceptually it’s the potential energy associated with holding a bit to an identifiable single value that is differentiable from the background thermal noise (representated by temperature).
The Gaussian channel capacity (Shannon, 1948) represents how identifiable a signal S, is relative to the background noise, N. Here we trigger a small exploration of potential relationship between these two values.
When we store a bit in memory, we maintain a signal that can be reliably distinguished from thermal noise, just as in a communication channel. This suggests that Landauer’s limit for erasure of one bit of information, Emin=kBT, and Shannon’s Gaussian channel capacity, C=12log2(1+SN),
Landauer’s limit states that erasing one bit of information requires a minimum energy of Emin=kBT. For a communication channel operating over time 1/B, the signal power S=EB and noise power N=kBTB. This gives us: C=12log2(1+SN)=12log2(1+EkBT)
When we operate at Landauer’s limit, setting E=kBT, we get a signal-to-noise ratio of exactly 1: SN=EkBT=1
The factor of 1/2 appears in Shannon’s formula because of Nyquist’s theorem - we need two samples per cycle at bandwidth B to represent a signal. The bandwidth B appears in both signal and noise power but cancels in their ratio, showing how Landauer’s energy-per-bit limit connects to Shannon’s bits-per-second capacity.
This connection suggests that Landauer’s limit may correspond to the energy needed to establish a signal-to-noise ratio sufficient to transmit one bit of information per second. The temperature T may set both the minimum energy scale for information erasure and the noise floor for information transmission.
This connection suggests that the fundamental limits on information processing may arise from the need to maintain signals above the thermal noise floor. Whether we’re erasing information (Landauer) or transmitting it (Shannon), we need to overcome the same fundamental noise threshold set by temperature.
This perspective suggests that both memory operations (erasure) and communication operations (transmission) are limited by the same physical principles. The temperature T emerges as a fundamental parameter that sets the scale for both energy requirements and information capacity.
The connection between Landauer’s limit and Shannon’s channel capacity is intriguing but still remains speculative. For Landauer’s original work see Landauer (1961), Bennett’s review and developments see Bennet (1982), and for a more recent overview and connection to developments in non-equilibrium thermodynamics Parrondo et al. (2015).
[edit]
How can we detect transitions between quantum-like and classical behaviour? The moment generating function (MGF) of the entropy provides a potential route for analyzing variable transitions and detecting transitions between X and M.
For each variable in our system, we can compute its moment generating function (MGF), MZi(t)=E[etZi]=exp(A(θ+tei)−A(θ))
The behavior of this MGF reveals what regimes variables are operating in.
This provides a diagnostic tool to identify which variables are functioning as quantum-like information reservoirs versus classical processing components.
import numpy as np
# Visualizing MGF differences between quantum-like and classical variables
# Define example MGF functions
def quantum_like_mgf(t):
"""MGF with oscillatory behavior (quantum-like)"""
return np.exp(t**2/2) * np.cos(2*t)
def classical_mgf(t):
"""MGF with monotonic growth (classical)"""
return np.exp(t**2/2)
# Create a range of t values
t = np.linspace(-3, 3, 1000)
# Compute MGFs
qm_mgf = quantum_like_mgf(t)
cl_mgf = classical_mgf(t)
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
# Create figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot MGFs
ax1.plot(t, qm_mgf, 'b-', label='Quantum-like variable')
ax1.plot(t, cl_mgf, 'r-', label='Classical variable')
ax1.set_xlabel('t')
ax1.set_ylabel('M(t)')
ax1.set_title('Moment (Cummulant) Generating Functions')
ax1.legend()
ax1.grid(True, linestyle='--', alpha=0.7)
# Plot log derivatives (oscillation index)
d_qm_log_mgf = np.gradient(np.log(np.abs(qm_mgf)), t)
d_cl_log_mgf = np.gradient(np.log(cl_mgf), t)
ax2.plot(t, d_qm_log_mgf, 'b-', label='Quantum-like variable')
ax2.plot(t, d_cl_log_mgf, 'r-', label='Classical variable')
ax2.set_xlabel('t')
ax2.set_ylabel('$\\frac{\\text{d} \\log M(t)}{\\text{d}t}')
ax2.set_title('Log-Derivative of MGF')
ax2.legend()
ax2.grid(True, linestyle='--', alpha=0.7)
mlai.write_figure(filename='oscillation-in-cummulant-generating-function.svg',
directory = './information-game')
The oscillation in the derivative of the log-MGF provides a clear signature of quantum-like behavior. This “oscillation index” can be used to quantify how much a variable displays quantum versus classical characteristics.
This analysis offers a practical method to detect the quantum-classical transition in our information reservoirs without needing to directly observe the system’s internal state. It connects directly to information-theoretic channel properties and provides a bridge between our abstract model and experimentally observable quantities.
[edit]
The uncertainty principle means that the game can exhibit quantum-like information processing regimes during evolution.
At minimal entropy states near the origin, the information reservoir has characteristics reminiscent of quantum systems.
Wave-like information encoding: The information reservoir near the origin necessarily encodes information in distributed, interference-capable patterns due to the uncertainty principle between parameters θ(M) and capacity variables c(M).
Non-local correlations: Parameters are highly correlated through the Fisher information matrix, creating structures where information is stored in relationships rather than individual variables.
Uncertainty-saturated regime: The uncertainty relationship Δθ(M)⋅Δc(M)≥k is nearly saturated (approaches equality), similar to Heisenberg’s uncertainty principle in quantum systems.
As the system evolves towards higher entropy states, a transition occurs where some variables exhibit classical behavior.
From wave-like to particle-like: Variables transitioning from M to X shift from storing information in interference patterns to storing it in definite values with statistical uncertainty.
Decoherence-like process: The uncertainty product Δθ(M)⋅Δc(M) for these variables grows significantly larger than the minimum value k, indicating a departure from quantum-like behavior.
Local information encoding: Information becomes increasingly encoded in local variables rather than distributed correlations.
The saddle points in our entropy landscape mark critical transitions between quantum-like and classical information processing regimes. Near these points
The critically slowed modes maintain quantum-like characteristics, functioning as coherent memory that preserves information through interference patterns.
The rapidly evolving modes exhibit classical characteristics, functioning as incoherent processors that manipulate information through statistical operations.
This natural separation creates a hybrid computational architecture where quantum-like memory interfaces with classical-like processing.
As the system evolves further toward higher entropy, a purely classical hierarchical memory structure can emerge. Unlike quantum-like reservoirs that rely on interference patterns and non-local correlations, classical information reservoirs in our system organize hierarchically:
Timescale Hierarchy: Variables separate into distinct timescale bands based on eigenvalues of the Fisher information matrix. Slower-changing variables (smaller eigenvalues) act as context for faster-changing variables (larger eigenvalues), creating a natural temporal hierarchy.
Markov Blanket Formation: Groups of variables form statistical “shields” or Markov blankets that conditionally separate one part of the system from another. This creates modular information processing units with relative statistical independence.
Mean-Field Dynamics: Fast variables respond to the average or “mean field” of slow variables, while slow variables integrate the statistics of fast variables. This two-way coupling creates stable hierarchical processing without requiring quantum coherence.
Scale-Free Organization: The hierarchical structure often exhibits scale-free properties, with similar statistical relationships appearing across different scales of organization. This enables efficient information compression and retrieval.
This classical hierarchical structure might be evident in systems with many variables and complex parameter spaces. It would emerge alongside the formation of conditional independence structures, p(X|M)≈∏ip(Xi|Mpa(i))
Such a hierarchical memory structure would maintains high information capacity through multiplexing across timescales rather than through quantum-like uncertainty relations. Variables at different levels of the hierarchy would simultaneously encode different aspects of information.
This classical hierarchical structure provides a powerful information processing architecture that emerges naturally from entropy maximization, without requiring quantum effects. Complex, efficient memory systems can develop purely through classical statistical mechanics when operating far from the minimal entropy regime.
The moment generating function MZ(t) still provides the diagnostic: classical hierarchical systems show distinct factorization patterns in the MGF that reflect the conditional independence structure, with each level of the hierarchy contributing characteristic timescales to the overall dynamics.
How do the zi variables transition between X and M? We need an approach to identifying when the character of variables has changed.
The moment generating function (MGF) can help identify transition candidates, MZ(t)=E[et⋅Z]=exp(A(θ+t)−A(θ)).
This transition can also be understood as a change in the Shannon channel characteristics of the variable - from a low-noise, precision-optimized channel (in M) to a high-bandwidth, high-entropy channel (in X).
[edit]
As the game evolves to classical regimes, a hierarchical memory structure can emerge. We illustrate the idea with a simple dynamical system example.
Consider a system with 8 variables that undergo steepest ascent entropy maximization. As the system evolves, assume the eigenvalue spectrum of the Fisher information matrix has a separation into timescales as follows.
This implies a natural hierarchy where slow variables can provide context for faster variables, and faster variables can be guidedl guided by slower variables.
import numpy as np
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai
import networkx as nx
# Visualizing hierarchical memory structure
# Create a hierarchical structure
G = nx.DiGraph()
# Add nodes for different timescales
timescales = {
'context': {'color': 'blue', 'size': 800, 'eigenvalue': 0.01},
'long-term': {'color': 'green', 'size': 500, 'eigenvalue': 0.1},
'intermediate': {'color': 'orange', 'size': 300, 'eigenvalue': 1.0},
'processing': {'color': 'red', 'size': 100, 'eigenvalue': 10.0}
}
# Add nodes and connections
for level in timescales:
G.add_node(level, **timescales[level])
# Add edges (hierarchical connections)
G.add_edge('context', 'long-term')
G.add_edge('context', 'intermediate')
G.add_edge('context', 'processing')
G.add_edge('long-term', 'intermediate')
G.add_edge('long-term', 'processing')
G.add_edge('intermediate', 'processing')
# Create figure
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
# Get node attributes
node_colors = [G.nodes[n]['color'] for n in G.nodes]
node_sizes = [G.nodes[n]['size'] for n in G.nodes]
eigenvalues = [G.nodes[n]['eigenvalue'] for n in G.nodes]
# Create a hierarchical layout
pos = {
'context': (0, 3),
'long-term': (0, 2),
'intermediate': (0, 1),
'processing': (0, 0)
}
# Draw the network
nx.draw_networkx(G, pos, with_labels=True, node_color=node_colors,
node_size=node_sizes, font_color='white',
font_weight='bold', ax=ax, arrowsize=20)
# Add eigenvalue labels
for node, position in pos.items():
eigenvalue = G.nodes[node]['eigenvalue']
ax.text(position[0] + 0.2, position[1],
f'$\\lambda = {eigenvalue}$',
fontsize=12)
ax.set_title('Hierarchical Memory Organization')
ax.set_axis_off()
# Add a second plot showing update dynamics
ax2 = fig.add_axes([0.6, 0.2, 0.35, 0.6])
t = np.linspace(0, 100, 1000)
for node, info in timescales.items():
eigenvalue = info['eigenvalue']
color = info['color']
ax2.plot(t, np.sin(eigenvalue * t) * np.exp(-0.01 * t),
color=color, label=f"{node} (λ={eigenvalue})")
ax2.set_xlabel('Time')
ax2.set_ylabel('Parameter value')
ax2.set_title('Update Dynamics at Different Timescales')
ax2.legend()
ax2.grid(True, linestyle='--', alpha=0.7)
mlai.write_figure(filename='hierarchical-memory-organisation-example.svg',
directory = './information-game')
Figure:
A hierarchical memory structure emerges naturally during entropy maximization. The timescale separation creates a computational architecture where different levels operate at different characteristic timescales.
The hierarchy is important in understanding how it is possible for information information reservoirs to achieve high capacity (entropy) without underlying quantum-like interference effects. Different variables are characterised based on their eigenvalue in the Fisher information matrix.
[edit]
The Jaynes’ world game illustrates fundamental principles of information dynamics.
Information Conservation: Total information remains constant but redistributes between structure and randomness. This follows from the fundamental uncertainty principle between parameters and capacity. As parameters become less precisely specified, capacity increases.
Uncertainty Principle: Precision in parameters trades off with entropy capacity. This is not merely a mathematical constraint but a necessary feature of any physical information reservoir that must maintain both stability and sufficient capacity.
Self-Organization: The system autonomously navigates toward maximum entropy while maintaining necessary structure through critically slowed modes. These modes function as information reservoirs that preserve essential constraints while allowing maximum entropy production elsewhere.
Information-Energy Duality: The framework connects to thermodynamic concepts through the relationship between entropy production and available work. As shown by Sagawa and Ueda, information gain can be translated into extractable work, suggesting that our entropy game has a direct thermodynamic interpretation.
The information-modified second law indicates that the maximum extractable work is increased by kBT⋅I(X;M), where I(X;M) is the mutual information between observable variables and memory. This creates a direct connection between our information reservoir model and physical thermodynamic systems.
The zero-player game provides a mathematical model for studying how complex systems evolve when they instantaneously maximize entropy production.
[edit]
The zero-player game Jaynes’ world provides a mathematical model for studying how complex systems evolve when they instantaneously maximize entropy production.
Our analysis suggests the game could illustrate the fundamental principles of information dynamics, including information conservation, an uncertainty principle, self-organization, and information-energy duality.
The game’s architecture should naturally organize into memory and processing components, without requiring explicit design.
The game’s temporal dynamics are based on steepest ascent in parameter space, this allows for analysis through the Fisher information matrix’s eigenvalue structure to create a natural separation of timescales and the natural emergence of information reservoirs.
There are multiple perspectives we can take to understanding optimal decision making: entropy games, thermodynamic information engines, least action principles (and optimal control), and Schrödinger’s bridge - provide different views. Through introducing Jaynes’ world we look to explore the relationship between these different views of decision making to provide a more complete perspective of the limitations and possibilities for making optimal decisions.
[edit]
The multiple perspectives we’ve explored - entropy games, information engines, least action principles, and Schrödinger’s bridge - provide complementary views of intelligence as optimal information processing. Each framework highlights different aspects of this fundamental process:
The Entropy Game shows us that intelligence can be measured by how efficiently a system reduces uncertainty through strategic questioning or observation.
Information Engines reveal how intelligence converts information into useful work, subject to thermodynamic constraints.
Least Action Principles demonstrate that intelligence follows optimal paths through information space, minimizing cumulative uncertainty.
Schrödinger’s Bridge illuminates how intelligence can be viewed as optimal transport of probability distributions, finding the most likely paths between states of knowledge.
These perspectives converge on a unified view: intelligence is fundamentally about optimal information processing. Whether we’re discussing human cognition, artificial intelligence, or biological systems, the capacity to efficiently acquire, process, and utilize information lies at the core of intelligent behavior.
This unified perspective offers promising directions for both theoretical research and practical applications. By understanding intelligence through the lens of information theory and thermodynamics, we may develop more principled approaches to artificial intelligence, gain deeper insights into cognitive processes, and discover fundamental limits on what intelligence can achieve.
For more information on these subjects and more you might want to check the following resources.
Alemi, A.A., Fischer, I., 2019. TherML: The thermodynamics of machine learning. arXiv Preprint arXiv:1807.04162.
Amari, S., 2016. Information geometry and its applications, Applied mathematical sciences. Springer, Tokyo. https://doi.org/10.1007/978-4-431-55978-8
Ashby, W.R., 1952. Design for a brain: The origin of adaptive behaviour. Chapman & Hall, London.
Barato, A.C., Seifert, U., 2014. Stochastic thermodynamics with information reservoirs. Physical Review E 90, 042150. https://doi.org/10.1103/PhysRevE.90.042150
Beckner, W., 1975. Inequalities in Fourier analysis. Annals of Mathematics 159–182. https://doi.org/10.2307/1970980
Bennet, C.H., 1982. The thermodynamics of computation—a review. International Journal of Theoretical Physics 21, 906–940.
Białynicki-Birula, I., Mycielski, J., 1975. Uncertainty relations for information entropy in wave mechanics. Communications in Mathematical Physics 44, 129–132. https://doi.org/10.1007/BF01608825
Boltzmann, L., n.d. Über die Beziehung zwischen dem zweiten Hauptsatze der mechanischen Warmetheorie und der Wahrscheinlichkeitsrechnung, respective den Sätzen über das wärmegleichgewicht. Sitzungberichte der Kaiserlichen Akademie der Wissenschaften. Mathematisch-Naturwissen Classe. Abt. II LXXVI, 373–435.
Brillouin, L., 1951. Maxwell’s demon cannot operate: Information and entropy. i. Journal of Applied Physics 22, 334–337. https://doi.org/10.1063/1.1699951
Bub, J., 2001. Maxwell’s demon and the thermodynamics of computation. Studies in History and Philosophy of Science Part B: Modern Physics 32, 569–579. https://doi.org/10.1016/S1355-2198(01)00023-5
Conway, F., Siegelman, J., 2005. Dark hero of the information age: In search of norbert wiener the father of cybernetics. Basic Books, New York.
Eddington, A.S., 1929. The nature of the physical world. Dent (London). https://doi.org/10.2307/2180099
Hirschman Jr, I.I., 1957. A note on entropy. American Journal of Mathematics 79, 152–156. https://doi.org/10.2307/2372390
Hosoya, A., Maruyama, K., Shikano, Y., 2015. Operational derivation of Boltzmann distribution with Maxwell’s demon model. Scientific Reports 5, 17011. https://doi.org/10.1038/srep17011
Hosoya, A., Maruyama, K., Shikano, Y., 2011. Maxwell’s demon and data compression. Phys. Rev. E 84, 061117. https://doi.org/10.1103/PhysRevE.84.061117
Jarzynski, C., 1997. Nonequilibrium equality for free energy differences. Physical Review Letters 78, 2690–2693. https://doi.org/10.1103/PhysRevLett.78.2690
Jaynes, E.T., 1963. Information theory and statistical mechanics, in: Ford, K.W. (Ed.), Brandeis University Summer Institute Lectures in Theoretical Physics, Vol. 3: Statistical Physics. W. A. Benjamin, Inc., New York, pp. 181–218.
Jaynes, E.T., 1957. Information theory and statistical mechanics. Physical Review 106, 620–630. https://doi.org/10.1103/PhysRev.106.620
Landauer, R., 1961. Irreversibility and heat generation in the computing process. IBM Journal of Research and Development 5, 183–191. https://doi.org/10.1147/rd.53.0183
Maxwell, J.C., 1871. Theory of heat. Longmans, Green; Co, London.
Mikhailov, G.K., n.d. Daniel bernoulli, hydrodynamica (1738).
Parrondo, J.M.R., Horowitz, J.M., Sagawa, T., 2015. Thermodynamics of information. Nature Physics 11, 131–139. https://doi.org/10.1038/nphys3230
Sagawa, T., Ueda, M., 2010. Generalized Jarzynski equality under nonequilibrium feedback control. Physical Review Letters 104, 090602. https://doi.org/10.1103/PhysRevLett.104.090602
Sagawa, T., Ueda, M., 2008. Second law of thermodynamics with discrete quantum feedback control. Physical Review Letters 100, 080403. https://doi.org/10.1103/PhysRevLett.100.080403
Shannon, C.E., 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Sharp, K., Matschinsky, F., 2015. Translation of Ludwig Boltzmann’s paper “on the relationship between the second fundamental theorem of the mechanical theory of heat and probability calculations regarding the conditions for thermal equilibrium.” Entropy 17, 1971–2009. https://doi.org/10.3390/e17041971
Szilard, L., 1929. Über die Entropieverminderung in einem thermodynamischen System bei Eingriffen intelligenter Wesen. Zeitschrift für Physik 53, 840–856. https://doi.org/10.1007/BF01341281