11. Statistics Preliminaries#
11.1. The Normal Distribution#
We say that a real-valued random variable (RV) \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\) if its probability density function (PDF) is:
and we usually write \(X \sim \Normal(\mu, \sigma^{2}).\) The parameters \(\mu\) and \(\sigma\) are related to the first and second moments of \(X.\)
(Moments of the Normal Distribution)
The parameter \(\mu\) is the mean or expectation of \(X\) while \(\sigma\) denote its standard deviation. The variance of \(X\) is given by \(\sigma^{2}.\)
Proof
Let \(X = \mu + \sigma Z\) where \(Z \sim \Normal(0, 1)\). Start by defining \(f(z) = e^{-\frac{1}{2} z^{2}},\) which implies that \(f^{\prime}(z) = -z e^{-\frac{1}{2} z^{2}}\) and \(f^{\prime \prime}(x) = z^{2} e^{-\frac{1}{2} z^{2}} - e^{-\frac{1}{2} z^{2}}.\) We can then write:
Also, note that:
Then,
We can now compute \(\ev(X) = \mu + \sigma \ev(Z) = \mu\) and \(\var(X) = \sigma^{2} \var(Z) = \sigma^{2}\).
As with any real-valued random variable \(X,\) in order to compute the probability that \(X \leq x\) we need to integrate the density function from \(-\infty\) to \(x \colon\)
The function \(F(x) = \prob(X \leq x)\) is called the cumulative distribution function of \(X\). The Leibniz integral rule implies that \(F^{\prime}(x) = f(x).\)
11.1.1. The Standard Normal Distribution#
An important case of normally distributed random variables is when \(\mu = 0\) and \(\sigma = 1\). In this case we say that \(Z \sim \Normal(0, 1)\) has the standard normal distribution and its cumulative distribution function is usually denoted by the capital Greek letter \(\Phi\) (phi), and is defined by the integral:
Since the integral cannot be solved in closed-form, the probability must then be obtained from a table or using a computer. For example, in Python we can compute \(\cdf(-0.4)\) by typing the following:
from scipy.stats import norm
norm.cdf(-0.4)
0.3445782583896758
If you prefer to use Excel, you need to type in a cell =norm.s.dist(-0.4,TRUE)
, which yields the same answer.
11.1.2. Left-Tail Probability#
Knowing how to compute or approximate \(\cdf(z)\) allows us to compute \(\prob(X \leq x)\) when \(X \sim \Normal(\mu, \sigma^{2})\) since \(Z = \frac{X - \mu}{\sigma} \sim \Normal(0, 1) \colon\)
where \(Z = \dfrac{X - \mu}{\sigma} \sim \Normal(0, 1)\) is called a Z-score.
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25.\) What is the probability that \(X \leq 0\)?
11.1.3. Right-Tail Probability#
For a random variable \(X,\) the right-tail probability is defined as \(\prob(X > x).\) Since \(\prob(X \leq x) + \prob(X > x) = 1,\) we have that:
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). What is the probability that \(X > 12\)?
Therefore, \(\prob(X > 12) = 1 - 0.5319 = 0.4681.\)
11.1.4. Interval Probability#
The probability that a random variable \(X\) falls within an interval \((X_{1}, X_{2}]\) is given by \(\prob(x_{1} < X \leq x_{2}) = \prob(X \leq x_{2}) - \prob(X \leq x_{1}).\)
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). What is the probability that \(2 < X \leq 14\)?
Therefore, \(\prob(2 < X \leq 14) = 0.5636 - 0.3745 = 0.1891\).
11.1.5. Percentiles#
For a standard normal variable \(Z\), a right-tail percentile is the value \(z_{\alpha}\) above which we obtain a certain probability \(\alpha.\) Mathematically, this means finding \(z_{\alpha}\) such that:
This implies that \(\cdf(z_{\alpha}) = 1 - \alpha\), or \(z_{\alpha} = \cdf^{-1}(1 - \alpha)\), where \(\cdf^{-1}(\cdot)\) denotes the inverse function of \(\cdf(\cdot)\). Again, there is no closed-form expression for this function and we need a computer to obtain the values. For example, say that \(\alpha = 0.025\). In Python we could compute \(z_{\alpha} = \cdf^{-1}(0.975)\) by using the function ppf
included in scipy.stats.norm
as follows:
from scipy.stats import norm
norm.ppf(0.975)
1.959963984540054
In Excel the function =norm.s.inv(0.975)
provides the same result.
The following table shows common values for \(z_{\alpha}\):
\(\boldsymbol{\alpha}\) |
\(\boldsymbol{z_{\alpha}}\) |
---|---|
0.050 |
1.64 |
0.025 |
1.96 |
0.010 |
2.33 |
0.005 |
2.58 |
A \((1 - \alpha)\) two-sided confidence interval (CI) defines left and right percentiles such that the probability on each side is \(\alpha/2\). For a standard normal variable \(Z\), the symmetry of its pdf implies:
Since \(z_{2.5\%} = 1.96\), the 95% confidence interval of \(Z\) is \([-1.96, 1.96]\). This means that if we randomly sample this variable 100,000 times, approximately 95,000 observations will fall inside this interval.
If \(X \sim \Normal(\mu, \sigma^{2})\), its confidence interval is determined by \(\xi\) and \(\zeta\) such that:
which implies that \(-z_{\alpha/2} = \tfrac{\xi - \mu}{\sigma}\) and \(z_{\alpha/2} = \tfrac{\zeta - \mu}{\sigma}\).The \((1 - \alpha)\) confidence interval for \(X\) is then \([\mu - z_{\alpha/2}\sigma, \mu + z_{\alpha/2}\sigma]\).
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). Since \(z_{2.5\%} = 1.96\), the 95% confidence interval of \(X\) is:
11.2. The Lognormal Distribution#
If \(X \sim \Normal(\mu, \sigma^{2})\), then \(Y = e^{X}\) is said to be lognormally distributed with the same parameters. The pdf of a lognormally distributed random variable \(Y\) can be obtained from the pdf of \(X\).
(Lognormal Density)
If \(Y\) is lognormally distributed with parameters \(\mu\) and \(\sigma^{2}\), the PDF of \(Y\) is given by:
Proof
Let \(Y = e^{X}\) where \(X = \mu + \sigma Z\) and \(Z \sim \Normal(0, 1)\). Then,
Let’s define \(z = e^{x}\). This implies that \(x = \ln(z)\), which in turn implies that \(dx = (1 / z) dz\). Therefore,
Thus, the integrand of the previous expression is the probability density function of \(Y\).
Unlike the normal density, the lognormal density function is not symmetric around its mean. Normally distributed variables can take values in \((-\infty, \infty)\), whereas lognormally distributed variables are always positive.
11.2.1. Computing Probabilities#
We can use the fact that the logarithm of a lognormal random variable is normally distributed to compute cumulative probabilities.
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). What is the probability that \(Y \leq 100\)?
11.2.2. Confidence Interval#
Let \(Y = e^{\mu + \sigma Z}\) where \(Z \sim \Normal(0, 1)\). We have that:
The \((1 - \alpha)\) confidence interval for \(Y\) is \([e^{\mu - \sigma z_{\alpha/2}}, e^{\mu + \sigma z_{\alpha/2}}]\).
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). The 95% confidence interval for \(Y\) is:
11.2.3. Moments#
(Moments of a Lognormal Distribution)
Let \(Y = e^{\mu + \sigma Z}\) where \(Z \sim \Normal(0, 1)\). We have that:
Proof
Using the fact that \(\alpha X \sim \Normal(\alpha \mu, (\alpha \sigma)^{2})\), it is also possible to compute the expectation of powers of lognormally distributed variables:
This is useful to compute the variance and standard deviation of \(Y\):
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). The expectation and standard deviation of \(Y\) are:
11.2.4. Partial Expectations#
When pricing a call option, the payoff is positive if the option is in-the-money and zero otherwise. We usually use an indicator function to quantify this behavior:
(Partial Expectations)
Let \(Y = e^{X}\) where \(X \sim \Normal(\mu, \sigma^{2})\). Then we have that:
Proof
The first expectation can be computed as:
The second expectation yields:
11.3. Practice Problems#
Suppose that \(X\) is a normally distributed random variable with mean \(\mu=12\) and standard deviation \(\sigma=20\).
What is the probability that \(X \leq 0\)?
What is the probability that \(X \leq -4\)?
What is the probability that \(X > 8\)?
What is the probability that \(4 < X \leq 10\)?
Solution to Exercise 11.1
\(\prob(X \leq 0) = \cdf(\frac{0-12}{20})\)
\(\hphantom{\prob(X \leq 0)} = \cdf(-0.60)\)
\(\hphantom{\prob(X \leq 0)} = 0.2743\).\(\prob(X \leq -4) = \cdf(\frac{-4-12}{20})\)
\(\hphantom{\prob(X \leq -4)} = \cdf(-0.80)\)
\(\hphantom{\prob(X \leq -4)} = 0.2119\).\(\prob(X > 8) = 1 - \prob(X \leq 8)\)
\(\hphantom{\prob(X > 8)} = 1 - \cdf(\frac{8-12}{20})\)
\(\hphantom{\prob(X > 8)} = 1 - \cdf(-0.20)\)
\(\hphantom{\prob(X > 8)} = 0.5793\).\(\prob(4 < X \leq 10) = \prob(X \leq 10) - \prob(X \leq 4)\)
\(\hphantom{\prob(4 < X \leq 10)} = \cdf(\frac{10-12}{20}) - \cdf(\frac{4-12}{20})\)
\(\hphantom{\prob(4 < X \leq 10)} = \cdf(-0.10) - \cdf(-0.40)\)
\(\hphantom{\prob(4 < X \leq 10)} = 0.1156\).
Suppose that \(X\) is a normally distributed random variable with mean \(\mu=10\) and standard deviation \(\sigma=20\). Compute the
90%,
95%, and
99%
confidence interval for \(X\).
Solution to Exercise 11.2
The \((1-\alpha)\) confidence interval (CI) for \(X\) is given by \([\mu - z_{\alpha/2} \sigma, \mu + z_{\alpha/2} \sigma]\) where \(z_{\alpha/2} = \cdf^{-1}(1-\alpha/2)\). For example, if you want to compute the \(z\)-level corresponding to the 90% confidence interval, then \(\alpha = 0.10\) and \(\alpha/2 = 0.05\), so to compute \(z_{0.05}\) you need to type in Excel =norm.s.inv(0.95)
.
\(z_{0.05} = \cdf^{-1}(0.95) = 1.64\) so the 90% CI for \(X\) is \([-22.90, 42.90]\).
\(z_{0.025} = \cdf^{-1}(0.975) = 1.96\) so the 95% CI for \(X\) is \([-29.20, 49.20]\).
\(z_{0.005} = \cdf^{-1}(0.995) = 2.58\) so the 99% CI for \(X\) is \([-41.52, 61.52]\).
Suppose that \(X=\ln(Y)\) is a normally distributed random variable with mean \(\mu=3.9\) and standard deviation \(\sigma=15\).
What is the probability that \(Y \leq 6\)?
What is the probability that \(Y > 4\)?
What is the probability that \(3 < Y \leq 12\)?
What is the probability that \(Y \leq 0\)?
Solution to Exercise 11.3
\(\prob(Y \leq 6) = \prob(X \leq \ln(Y))\)
\(\hphantom{\prob(Y \leq 6)} = \cdf(\frac{\ln(6)-3.9}{15})\)
\(\hphantom{\prob(Y \leq 6)} = \cdf(-0.1405)\)
\(\hphantom{\prob(Y \leq 6)} = 0.4441\)\(\prob(Y > 4) = 1 - \prob(Y \leq 4)\)
\(\hphantom{\prob(Y > 4)} = 1 - \prob(X \leq \ln(4))\)
\(\hphantom{\prob(Y > 4)} = 1 - \cdf(\frac{\ln(4)-3.9}{15})\)
\(\hphantom{\prob(Y > 4)} = 1 - \cdf(-0.1676)\)
\(\hphantom{\prob(Y > 4)} = 0.5665\)\(\prob(3 < Y \leq 12) = \prob(Y \leq 12) - \prob(Y \leq 3)\)
\(\hphantom{\prob(3 < Y \leq 12)} = \cdf(\frac{\ln(12)-3.9}{15}) - \cdf(\frac{\ln(3)-3.9}{15})\)
\(\hphantom{\prob(3 < Y \leq 12)} = \cdf(-0.0943) - \cdf(-1868)\)
\(\hphantom{\prob(3 < Y \leq 12)} = 0.4624 - 0.4259\)
\(\hphantom{\prob(3 < Y \leq 12)} = 0.0365\)\(\prob(Y \leq 0) = \prob(X \leq -\infty) = 0\)
Suppose that \(X=\ln(Y)\) is a normally distributed random variable with mean \(\mu=2.7\) and standard deviation \(\sigma=1\). Compute the
90%,
95%, and
99%
confidence interval for \(X\) and report the corresponding values for \(Y\).
Solution to Exercise 11.4
The \((1 - \alpha)\) confidence interval (CI) for \(X\) is given by \([\mu - z_{\alpha/2} \sigma, \mu + z_{\alpha/2} \sigma]\). Remember that to compute \(z_{\alpha/2}\) we use in Excel =norm.s.inv(1-alpha/2)
. The corresponding interval for \(Y\) is then \([e^{\mu - z_{\alpha/2} \sigma}, e^{\mu + z_{\alpha/2} \sigma}]\).
\(z_{0.05} = 1.64\) so the 90% CI for \(X\) is \([1.06, 4.34]\), and the corresponding values for \(Y\) are \([2.87, 77.08]\).
\(z_{0.025} = 1.96\) so the 95% CI for \(X\) is \([0.74, 4.66]\), and the corresponding values for \(Y\) are \([2.10, 105.63]\).
\(z_{0.005} = 2.58\) so the 99% CI for \(X\) is \([0.12, 5.28]\), and the corresponding values for \(Y\) are \([1.13, 195.55]\).
Let \(Y = e^{\mu + \sigma Z}\) where \(\mu = 1\), \(\sigma = 2\) and \(Z \sim \Normal(0, 1)\). Compute:
\(\ev(Y)\)
\(\stdev(Y) = \sqrt{\ev(Y^{2}) - \ev(Y)^{2}}\)
\(\ev(Y^{0.3})\)
\(\ev(Y^{-1})\)
Solution to Exercise 11.5
In some of the questions we use the fact that if \(X \sim \Normal(\mu, \sigma^{2})\), then \(\alpha X \sim \Normal(\alpha\mu, \alpha^{2}\sigma^{2})\), which implies that \(\ev(Y^{\alpha}) = \ev(e^{\alpha X}) = e^{\alpha\mu+\frac{1}{2}\alpha^{2}\sigma^{2}}\).
\(\ev(Y) = e^{1+\frac{1}{2}2^{2}} = 20.09\)
\(\ev(Y^{2}) = e^{(2)(1)+\frac{1}{2}(2)^{2}2^{2}}\)
\(\hphantom{\ev(Y^{2})} = 22026.47\),
\(\left(\ev(Y)\right)^{2} = (20.09)^{2}\)
\(\hphantom{\left(\ev(Y)\right)^{2}} = 403.43\),
\(\stdev(Y) = \sqrt{22026.47-403.43}\)
\(\hphantom{\stdev(Y)} = 147.05\).\(\ev(Y^{0.3}) = e^{(0.3)(1)+\frac{1}{2}(0.3)^{2}2^{2}} = 1.62\)
\(\ev(Y^{-1}) = e^{(-1)(1)+\frac{1}{2}(-1)^{2}2^{2}} = 2.72\)