A real-valued random variable is a mapping from outcome space S to the real line ℜ. A real-valued random variable X can be characterized by its probability distribution, which specifies (for a suitable collection of subsets of the real line ℜ that comprises a sigma-algebra), the chance that the value of X will be in each such subset. There are technical requirements regarding measurability, which generally we will ignore. Perhaps the most natural mathematical setting for probability theory involves Lebesgue integration; we will largely ignore the difference between a Riemann integral and a Lebesgue integral.
Let PX denote the probability distribution of the random variable X. Then if A⊂ℜ, PX(A)=P{X∈A}. We write X∼PX, pronounced "X is distributed as PX" or "X has distribution PX."
If two random variables X and Y have the same distribution, we write X∼Y and we say that X and Y are identically distributed.
Real-valued random variables can be continuous, discrete, or mixed (general).
Continuous random variables have probability density functions with respect to Lebesgue measure. If X is a continuous random variables, there is some nonnegative function f(x), the probability density of X, such that for any (suitable) set A⊂ℜ, P{X∈A}=∫Af(x)dx.
Example. Let f(x)=λe−λx for x≥0, with λ>0 fixed, and f(x)=0 otherwise. Clearly f(x)≥0. ∫∞−∞f(x)dx=∫∞0λe−λxdx=−e−λx|∞0=−0+1=1.
Example. Let a and b be real numbers with a<b, and let f(x)=1b−a, x∈[a,b] and f(x)=0, otherwise. Then f(x)≥0 and ∫∞−∞f(x)dx=∫ba1b−a=1, so f(x) can be the probability density function of a continuous random variable. A random variable with this density is sad to be uniformly distributed on the interval [a,b].
Discrete random variables assign all their probability to some countable set of points {xi}ni=1, where n might be infinite. Discrete random variables have probability mass functions. If X is a discrete random variable, there is a nonnegative function p, the probability mass function of X, such that for any set A⊂ℜ, P{X∈A}=∑i:xi∈Ap(xi).
Example. Fix λ>0. Let xi=i−1 for i=1,2,…, and let p(xi)=e−λλxi/xi!. Then p(xi)>0 and ∞∑i=1p(xi)=e−λ∞∑j=0λj/j!=e−λeλ=1.
Example. Let xi=i for i=1,…,n, and let p(xi)=1/n and p(x)=0, otherwise. Then p(x)≥0 and ∑xip(xi)=1. Hence, p(x) can be the probability mass function of a discrete random variable. A random variable with this probability mass function is said to be uniformly distributed on 1,…,n.
Example. Let xi=i−1 for i=1,…,n+1, and let p(xi)=(nxi)pxi(1−p)n−xi, and p(x)=0 otherwise. Then p(x)≥0 and ∑xip(xi)=n∑j=0(nj)pj(1−p)n−j=1,
For general random variables, the chance that X is in some subset of ℜ cannot be written as a sum or as a Riemann integral; it is more naturally represented as a Lebesgue integral (with respect to a measure other than Lebesgue measure). For example, imagine a random variable X that has probability α of being equal to zero; and if X is not zero, it has a uniform distribution on the interval [0,1]. Such a random variable is neither continuous nor discrete.
Most of the random variables in this class are either discrete or continuous.
If X is a random variable such that, for some constant x1∈ℜ, P(X=x1)=1, X is called a constant random variable.
equispaced on the interval (0,1), find the maximum absolute value of the difference between the sum and 1.
(A random variable with this probability mass function is said to be geometrically distributed with parameter p.)
The cumulative distribution function or cdf of a real-valued random variable is the chance that the variable is less than x, as a function of x. Cumulative distribution functions are often denoted with capital Roman letters (F is especially common notation):
FX(x)≡P(X≤x).Clearly:
The cdf of a continuous real-valued random variable is a continuous function. The cdf of a discrete real-valued random variable is piecewise constant, with jumps at the possible values of the random variable. If the cdf of a real-valued random variable has jumps and also regions where it is not constant, the random variable is neither continuous nor discrete.
[To Do]
# boilerplate
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy import stats # distributions
from scipy import special # special functions
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, FloatRangeSlider, fixed # interactive stuff
# Examples of densities and cdfs
# U[0,1]
def pltUnif(a,b):
ffac = 0.1
s = b-a
fudge = ffac*s
x = np.arange(a-fudge, b+fudge, s/200)
y = np.ones(len(x))/s
y[x<a] = np.zeros(np.sum(x < a)) # zero for x < a
y[x>b] = np.zeros(np.sum(x > b)) # zero for x > b
Y = (x-a)/s # uniform CDF is linear
Y[x<a] = np.zeros(np.sum(x < a))
Y[x >= b] = np.ones(np.sum(x >= b))
plt.plot(x,y,'b-',x,Y,'r-',linewidth=2)
plt.plot((a-fudge, b+fudge), (0.5, 0.5), 'g--') # horizontal green dashed line at 0.5
plt.plot((a-fudge, b+fudge), (0, 0), 'k-') # horizontal black line at 0
plt.xlabel('$x$') # axis labels. Can use LaTeX math markup
plt.ylabel(r'$f(x) = 1_{[a,b]}/(b-a)')
plt.axis([a-fudge,b+fudge,-0.1,(1+ffac)*max(1, 1/s)]) # axis limits
plt.title('The $U[$' + str(a) + ',' + str(b) + '$]$ density and cdf')
plt.show()
interactive(pltUnif, \
[a, b] = FloatRangeSlider(min = -5, max = 5, step = 0.05, lower=-1, upper=1))
# Exponential(lambda)
def plotExp(lam):
ffac = 0.05
x = np.arange(0, 5/lam, step=(5/lam)/200)
y = sp.stats.expon.pdf(x, scale = 1/lam)
Y = sp.stats.expon.cdf(x, scale = 1/lam)
plt.plot(x,y,'b-',x,Y,'r-',linewidth=2)
plt.plot((-.1, (1+ffac)*np.max(x)), (0.5, 0.5), 'g--') # horizontal line at 0.5
plt.plot((-.1, (1+ffac)*np.max(x)), (1, 1), 'k:') # horizontal line at 1
plt.xlabel('$x$') # axis labels. Can use LaTeX math markup
plt.ylabel(r'$f(x) = \lambda e^{-\lambda x}; F(x) = 1-e^{\lambda x}$.')
plt.title(r'The exponential density and cdf for $\lambda=$' + str(lam))
plt.axis([-.1,(1+ffac)*np.max(x),-0.1,(1+ffac)*max(1, lam)]) # axis limits
plt.show()
interact(plotExp, lam=(0, 10, 1))
Often we work with more than one random variable at a time. Indeed, much of this course concerns random vectors, the components of which are individual real-valued random variables.
The joint probability distribution of a collection of random variables {Xi}ni=1 gives the probability that the variables simultaneously fall in subsets of their possible values. That is, for every (suitable) subset A∈ℜn, the joint probability distribution of {Xi}ni=1 gives P{(X1,…,Xn)∈A}.
An event determined by the random variable X is an event of the form X∈A, where A⊂ℜ.
An event determined by the random variables ${X_j}{j \in J}isaneventoftheform(X_j)_{j \in J} \in A,whereA \subset \Re^{#J}$.
Two random variables X1 and X2 are independent if every event determined by X1 is independent of every event determined by X2. If two random variables are not independent, they are dependent.
A collection of random variables {Xi}ni=1 is independent if every event determined by every subset of those variables is independent of every event determined by any disjoint subset of those variables. If a collection of random variables is not independent, it is dependent.
Loosely speaking, a collection of random variables is independent if learning the values of some of them tells you nothing about the values of the rest of them. If learning the values of some of them tells you anything about the values of the rest of them, the collection is dependent.
For instance, imagine tossing a fair coin twice and rolling a fair die. Let X1 be the number of times the coin lands heads, and X2 be the number of spots that show on the die. Then X1 and X2 are independent: learning how many times the coin lands heads tells you nothing about what the die did.
On the other hand, let X1 be the number of times the coin lands heads, and let X2 be the sum of the number of heads and the number of spots that show on the die. Then X1 and X2 are dependent. For instance, if you know the coin landed heads twice, you know that the sum of the number of heads and the number of spots must be at least 3.
See SticiGui: The Long Run and the Expected Value for an elementary introduction to expectation.
The expectation or expected value of a random variable X, denoted EX, is a probability-weighted average of its possible values. From a frequentist perspective, it is the long-run limit (in probabiity) of the average of its values in repeated experiments. The expected value of a real-valued random variable (when it exists) is a fixed number, not a random value. The expected value depends on the probability distribution of X but not on any realized value of X. If two random variables have the same probability distribution, they have the same expected value.
value of a constant random variable is that constant.
If X is a continuous real-valued random variable with density f(x), then the expected value of X is EX=∫∞−∞xf(x)dx,
If X is a discrete real-valued random variable with probability function p, then the expected value of X is EX=∞∑i=1xip(xi),
Suppose X can take only two values, 0 and 1, and the probability that X=1 is p. Then
EX=1×p+0×(1−p)=p.[To do.] Derive the Binomial distribution as the number of successes in n iid Bernoulli trials.
The number of successes X in n trials is equivalent to the sum of indicators for the success in each trial. That is,
X=n∑i=1Xi,where Xi=1 if the ith trial results in success, and Xi=0 otherwise. By the additive property of expectation,
EX=En∑i=1Xi=n∑i=1EXi=n∑i=1p=np.The number of trials to the first success in iid Bernoulli(p) trials has a geometric distribution with parameter p.
[To do.] Derive the geometric and calculate expectation.
The number of trials to the kth success in iid Bernoulli(p) trials has a negative binomial distribution with parameters p and k.
[To do.] Derive the negative binomial.
The number of trials X until the kth success in iid Bernoulli trials can be written as the number of trials until the 1st success plus the number to the second success plus \hellip; plus the number of trials to the kth success. Each of those k "waiting times" Xi has a geometric distribution. Hence
EX=Ek∑i=1Xi=k∑i=1EXi=k∑i=11/p=k/p.[To do.] Derive hypergeometric.
Population of N numbers of which G equal 1 and N−G equal 0. Number of 1s in a sample of size n drawn without replacement.
P{X=x}=(Gx)(N−gn−x)(Nn).[To do.] Calculate expected value. Use random permutations of "tickets" to show that expected value in each position is G/N.
See SticiGui: Standard Error for an elementary introduction to variance and standard error.
The variance of a random variable X is Var X=E(X−EX)2.
Algebraically, the following identity holds: Var X=E(X−EX)2=EX2−2(EX)2+(EX)2=EX2−(EX)2.
The standard error of a random variable X is SE X=√Var X.
If {Xi}ni=1 are independent, then Var∑ni=1Xi=∑ni=1Var Xi.
If X and Y have a joint distribution, then cov(X,Y)=E(X−EX)(Y−EY). It follows from this definition (and the commutativity of multiplication) that cov(X,Y)=cov(Y,X). Also, var (X+Y)=var X+var Y+2cov(X,Y).
If X and Y are independent, cov (X,Y)=0. However, the converse is not necessarily true: cov(X,Y)=0 does not in general imply that X and Y are independent.
Suppose {Xi}ni=1 are jointly distributed random variables, and let X=(X1⋮Xn).
The expected value of X is EX≡(EX1⋮EXn).
The covariance matrix of X is cov X≡E((X1−EX1⋮Xn−EXn)(X1−EX1⋯Xn−EXn))=E((X1−EX1)2(X1−EX1)(X2−EX2)⋯(X1−EX1)(Xn−EXn)(X1−EX1)(X2−EX2)(X2−EX2)2⋯(X2−EX2)(Xn−EXn)⋮⋮⋱⋮(X1−EX1)(Xn−EXn)(X2−EX2)(Xn−EXn)⋯(Xn−EXn)2)
Covariance matrices are always positive semidefinite. (If x′Ax≥0 for all x∈ℜn, A is nonnegative definite (or positive semi-definite. Here is a review of linear algebra.)
The notation X∼N(μ,σ2) means that X has a normal distribution with mean μ and variance σ2. This distribution is continuous, with probability density function 1√2πσe−(x−μ)22σ2.
If X∼N(μ,σ2), then X−μσ∼N(0,1), the standard normal distribution. The probability density function of the standard normal distribution is ϕ(x)=1√2πe−x2/2.
## Plot the standard normal density and cdf
def plotNorm(mu, sigma):
x = np.arange(mu-4*sigma, mu+4*sigma, 8*sigma/200)
y = np.exp(-x**2/(2*sigma**2))/(sigma*math.sqrt(2*math.pi)) # for clarity
Y = sp.stats.norm.cdf(x, loc=mu, scale=sigma) # using scipy for convenience
plt.plot(x,y,'b-',x,Y,'r-',linewidth=2)
plt.plot((mu-4.1*sigma, mu+4.1*sigma), (0.5, 0.5), 'g--') # horizontal line at 0.5
plt.xlabel('$x$') # axis labels. Can use LaTeX math markup
plt.ylabel(r'$f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-x^2/2\sigma^2}$; $F(x)$')
plt.axis([mu-4.1*sigma, mu+4.1*sigma,0,max(1.1,max(y))]) # axis limits
plt.title(r'The $\mathcal{N}($' + str(mu) + ',' + str(sigma**2) + '$)$ density and cdf')
plt.show()
interact(plotNorm, mu=(-5,5,.05), sigma=(0.1, 10, .1))
A collection of random variables {X1,X2,…,Xn}={Xj}nj=1 is jointly normal if all linear combinations of those variables have normal distributions. That is, the collection is jointly normal if for all α∈ℜn, ∑nj=1αjXj has a normal distribution.
If {Xj}nj=1 are independent, normally distributed random variables, they are jointly normal.
If for some μ∈ℜn and positive-definite matrix G, the joint density of {Xj}nj=1 is (1√2π)n1√|G|exp{−12(x−μ)′G−1(x−μ)},
For an elementary discussion, see SticiGui: The Normal Curve, The Central Limit Theorem, and Markov's and Chebychev's Inequalities for Random Variables.
Suppose {Xj}∞j=1 are independent and identically distributed (iid), have finite expected value EXj=μ, and have finite variance var Xj=σ2.
Define the sum Sn≡∑nj=1Xj. Then ESn=En∑j=1Xj=n∑j=1EXj=n∑j=1μ=nμ,
Define Zn≡Sn−nμ√nσ. Then for every a,b∈ℜ with a≤b, limn→∞P{a≤Zn≤b}=1√2π∫bae−x2/2dx.
The conditional distribution of a random variable or random vector X given the event A is
PX|A(B)=P{X∈B|A}as a function of B, provided P(A)>0.
[To do]
[To do]
Conditional expectation is a random variable... The expectation of the conditional expectation is the unconditional expectation E(E(X|Y))=EX. This is essentially another expression of the law of total probability.
[To do] Use random permutation of a list of numbers to illustrate: EXj, E(Xj|Xk=x), E(Xj|Xk), E(E(Xj|Xk))=\mathbb{E}X_j$.
Point processes formalize the notion of something occurring at a random place or time (or both).
For instance, imagine the radioactive decay of a mass of uranium; the particular times at which an atom decays are modeled well as a Poisson process (described below).
Temporal, spatiotemporal. Waiting times (inter-arrival times) are exponential. Alternative characterizations.
Temporal point processes: the counting function. [To Do]
[To Do]
[To Do]
[To Do]
[To Do]