11. Statistics Preliminaries#
11.1. The Normal Distribution#
Fig. 11.1 The figure shows the density function of a normally distributed random variable with mean \(\mu\) and standard deviation \(\sigma.\)
We say that a real-valued random variable (RV) \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\) if its probability density function (PDF) is:
and we usually write \(X \sim \Normal(\mu, \sigma^{2}).\) The parameters \(\mu\) and \(\sigma\) are related to the first and second moments of \(X.\)
Property 11.1 (Moments of the Normal Distribution)
The parameter \(\mu\) is the mean or expectation of \(X\) while \(\sigma\) denote its standard deviation. The variance of \(X\) is given by \(\sigma^{2}.\)
Proof
Let \(X = \mu + \sigma Z\) where \(Z \sim \Normal(0, 1)\). Start by defining \(f(z) = e^{-\frac{1}{2} z^{2}},\) which implies that \(f^{\prime}(z) = -z e^{-\frac{1}{2} z^{2}}\) and \(f^{\prime \prime}(x) = z^{2} e^{-\frac{1}{2} z^{2}} - e^{-\frac{1}{2} z^{2}}.\) We can then write:
Also, note that:
Then,
We can now compute \(\ev(X) = \mu + \sigma \ev(Z) = \mu\) and \(\var(X) = \sigma^{2} \var(Z) = \sigma^{2}\).
As with any real-valued random variable \(X,\) in order to compute the probability that \(X \leq x\) we need to integrate the density function from \(-\infty\) to \(x \colon\)
The function \(F(x) = \prob(X \leq x)\) is called the cumulative distribution function of \(X\). The Leibniz integral rule implies that \(F^{\prime}(x) = f(x).\)
11.1.1. The Standard Normal Distribution#
Fig. 11.2 The blue shaded area represents \(\cdf(z).\)
An important case of normally distributed random variables is when \(\mu = 0\) and \(\sigma = 1\). In this case we say that \(Z \sim \Normal(0, 1)\) has the standard normal distribution and its cumulative distribution function is usually denoted by the capital Greek letter \(\Phi\) (phi), and is defined by the integral:
Since the integral cannot be solved in closed-form, the probability must then be obtained from a table or using a computer. For example, in Python we can compute \(\cdf(-0.4)\) by typing the following:
from scipy.stats import norm
norm.cdf(-0.4)
0.3445782583896758
If you prefer to use Excel, you need to type in a cell =norm.s.dist(-0.4,TRUE)
, which yields the same answer.
11.1.2. Left-Tail Probability#
Knowing how to compute or approximate \(\cdf(z)\) allows us to compute \(\prob(X \leq x)\) when \(X \sim \Normal(\mu, \sigma^{2})\) since \(Z = \frac{X - \mu}{\sigma} \sim \Normal(0, 1) \colon\)
where \(Z = \dfrac{X - \mu}{\sigma} \sim \Normal(0, 1)\) is called a Z-score.
Example 11.1
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25.\) What is the probability that \(X \leq 0\)?
11.1.3. Right-Tail Probability#
Fig. 11.3 The right-tail probability is the probability of the whole distribution, which is one, minus the left-tail probability.
For a random variable \(X,\) the right-tail probability is defined as \(\prob(X > x).\) Since \(\prob(X \leq x) + \prob(X > x) = 1,\) we have that:
Example 11.2
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). What is the probability that \(X > 12\)?
Therefore, \(\prob(X > 12) = 1 - 0.5319 = 0.4681.\)
11.1.4. Interval Probability#
Fig. 11.4 If you subtract the area to the left of \(x_{1}\) to the area that is to the left of \(x_{2}\) you obtain the probability of \(x_{1} < X \leq x_{2}.\)
The probability that a random variable \(X\) falls within an interval \((X_{1}, X_{2}]\) is given by \(\prob(x_{1} < X \leq x_{2}) = \prob(X \leq x_{2}) - \prob(X \leq x_{1}).\)
Example 11.3
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). What is the probability that \(2 < X \leq 14\)?
Therefore, \(\prob(2 < X \leq 14) = 0.5636 - 0.3745 = 0.1891\).
11.1.5. Percentiles#
Fig. 11.5 The right-tail percentile is the value \(z_{\alpha}\) that gives an area to the right equal to \(\alpha\).
For a standard normal variable \(Z\), a right-tail percentile is the value \(z_{\alpha}\) above which we obtain a certain probability \(\alpha.\) Mathematically, this means finding \(z_{\alpha}\) such that:
This implies that \(\cdf(z_{\alpha}) = 1 - \alpha\), or \(z_{\alpha} = \cdf^{-1}(1 - \alpha)\), where \(\cdf^{-1}(\cdot)\) denotes the inverse function of \(\cdf(\cdot)\). Again, there is no closed-form expression for this function and we need a computer to obtain the values. For example, say that \(\alpha = 0.025\). In Python we could compute \(z_{\alpha} = \cdf^{-1}(0.975)\) by using the function ppf
included in scipy.stats.norm
as follows:
from scipy.stats import norm
norm.ppf(0.975)
1.959963984540054
In Excel the function =norm.s.inv(0.975)
provides the same result.
The following table shows common values for \(z_{\alpha}\):
\(\boldsymbol{\alpha}\) |
\(\boldsymbol{z_{\alpha}}\) |
---|---|
0.050 |
1.64 |
0.025 |
1.96 |
0.010 |
2.33 |
0.005 |
2.58 |
Fig. 11.6 The areas on each side are both equal to \(\alpha/2.\)
A \((1 - \alpha)\) two-sided confidence interval (CI) defines left and right percentiles such that the probability on each side is \(\alpha/2\). For a standard normal variable \(Z\), the symmetry of its pdf implies:
Example 11.4
Since \(z_{2.5\%} = 1.96\), the 95% confidence interval of \(Z\) is \([-1.96, 1.96]\). This means that if we randomly sample this variable 100,000 times, approximately 95,000 observations will fall inside this interval.
If \(X \sim \Normal(\mu, \sigma^{2})\), its confidence interval is determined by \(\xi\) and \(\zeta\) such that:
which implies that \(-z_{\alpha/2} = \tfrac{\xi - \mu}{\sigma}\) and \(z_{\alpha/2} = \tfrac{\zeta - \mu}{\sigma}\).The \((1 - \alpha)\) confidence interval for \(X\) is then \([\mu - z_{\alpha/2}\sigma, \mu + z_{\alpha/2}\sigma]\).
Example 11.5
Suppose that \(X \sim \Normal(\mu, \sigma^{2})\) with \(\mu = 10\) and \(\sigma = 25\). Since \(z_{2.5\%} = 1.96\), the 95% confidence interval of \(X\) is:
11.2. The Lognormal Distribution#
If \(X \sim \Normal(\mu, \sigma^{2})\), then \(Y = e^{X}\) is said to be lognormally distributed with the same parameters. The pdf of a lognormally distributed random variable \(Y\) can be obtained from the pdf of \(X\).
Fig. 11.7 The figure shows the difference between a normal and a lognormal PDF with the same parameters.
Property 11.2 (Lognormal Density)
If \(Y\) is lognormally distributed with parameters \(\mu\) and \(\sigma^{2}\), the PDF of \(Y\) is given by:
Proof
Let \(Y = e^{X}\) where \(X = \mu + \sigma Z\) and \(Z \sim \Normal(0, 1)\). Then,
Let’s define \(z = e^{x}\). This implies that \(x = \ln(z)\), which in turn implies that \(dx = (1 / z) dz\). Therefore,
Thus, the integrand of the previous expression is the probability density function of \(Y\).
Unlike the normal density, the lognormal density function is not symmetric around its mean. Normally distributed variables can take values in \((-\infty, \infty)\), whereas lognormally distributed variables are always positive.
11.2.1. Computing Probabilities#
We can use the fact that the logarithm of a lognormal random variable is normally distributed to compute cumulative probabilities.
Example 11.6
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). What is the probability that \(Y \leq 100\)?
11.2.2. Confidence Interval#
Let \(Y = e^{\mu + \sigma Z}\) where \(Z \sim \Normal(0, 1)\). We have that:
The \((1 - \alpha)\) confidence interval for \(Y\) is \([e^{\mu - \sigma z_{\alpha/2}}, e^{\mu + \sigma z_{\alpha/2}}]\).
Example 11.7
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). The 95% confidence interval for \(Y\) is:
11.2.3. Moments#
Property 11.3 (Moments of a Lognormal Distribution)
Let \(Y = e^{\mu + \sigma Z}\) where \(Z \sim \Normal(0, 1)\). We have that:
Proof
Using the fact that \(\alpha X \sim \Normal(\alpha \mu, (\alpha \sigma)^{2})\), it is also possible to compute the expectation of powers of lognormally distributed variables:
This is useful to compute the variance and standard deviation of \(Y\):
Example 11.8
Let \(Y = e^{4 + 1.5 Z}\) where \(Z \sim \Normal(0, 1)\). The expectation and standard deviation of \(Y\) are:
11.2.4. Partial Expectations#
When pricing a call option, the payoff is positive if the option is in-the-money and zero otherwise. We usually use an indicator function to quantify this behavior:
Property 11.4 (Partial Expectations)
Let \(Y = e^{X}\) where \(X \sim \Normal(\mu, \sigma^{2})\). Then we have that:
Proof
The first expectation can be computed as:
The second expectation yields:
11.3. Practice Problems#
Exercise 11.1
Suppose that \(X\) is a normally distributed random variable with mean \(\mu=12\) and standard deviation \(\sigma=20\).
What is the probability that \(X \leq 0\)?
What is the probability that \(X \leq -4\)?
What is the probability that \(X > 8\)?
What is the probability that \(4 < X \leq 10\)?
Solution to Exercise 11.1
\(\prob(X \leq 0) = \cdf(\frac{0-12}{20})\)
\(\hphantom{\prob(X \leq 0)} = \cdf(-0.60)\)
\(\hphantom{\prob(X \leq 0)} = 0.2743\).\(\prob(X \leq -4) = \cdf(\frac{-4-12}{20})\)
\(\hphantom{\prob(X \leq -4)} = \cdf(-0.80)\)
\(\hphantom{\prob(X \leq -4)} = 0.2119\).\(\prob(X > 8) = 1 - \prob(X \leq 8)\)
\(\hphantom{\prob(X > 8)} = 1 - \cdf(\frac{8-12}{20})\)
\(\hphantom{\prob(X > 8)} = 1 - \cdf(-0.20)\)
\(\hphantom{\prob(X > 8)} = 0.5793\).\(\prob(4 < X \leq 10) = \prob(X \leq 10) - \prob(X \leq 4)\)
\(\hphantom{\prob(4 < X \leq 10)} = \cdf(\frac{10-12}{20}) - \cdf(\frac{4-12}{20})\)
\(\hphantom{\prob(4 < X \leq 10)} = \cdf(-0.10) - \cdf(-0.40)\)
\(\hphantom{\prob(4 < X \leq 10)} = 0.1156\).
Exercise 11.2
Suppose that \(X\) is a normally distributed random variable with mean \(\mu=10\) and standard deviation \(\sigma=20\). Compute the
90%,
95%, and
99%
confidence interval for \(X\).
Solution to Exercise 11.2
The \((1-\alpha)\) confidence interval (CI) for \(X\) is given by \([\mu - z_{\alpha/2} \sigma, \mu + z_{\alpha/2} \sigma]\) where \(z_{\alpha/2} = \cdf^{-1}(1-\alpha/2)\). For example, if you want to compute the \(z\)-level corresponding to the 90% confidence interval, then \(\alpha = 0.10\) and \(\alpha/2 = 0.05\), so to compute \(z_{0.05}\) you need to type in Excel =norm.s.inv(0.95)
.
\(z_{0.05} = \cdf^{-1}(0.95) = 1.64\) so the 90% CI for \(X\) is \([-22.90, 42.90]\).
\(z_{0.025} = \cdf^{-1}(0.975) = 1.96\) so the 95% CI for \(X\) is \([-29.20, 49.20]\).
\(z_{0.005} = \cdf^{-1}(0.995) = 2.58\) so the 99% CI for \(X\) is \([-41.52, 61.52]\).
Exercise 11.3
Suppose that \(X=\ln(Y)\) is a normally distributed random variable with mean \(\mu=3.9\) and standard deviation \(\sigma=15\).
What is the probability that \(Y \leq 6\)?
What is the probability that \(Y > 4\)?
What is the probability that \(3 < Y \leq 12\)?
What is the probability that \(Y \leq 0\)?
Solution to Exercise 11.3
\(\prob(Y \leq 6) = \prob(X \leq \ln(Y))\)
\(\hphantom{\prob(Y \leq 6)} = \cdf(\frac{\ln(6)-3.9}{15})\)
\(\hphantom{\prob(Y \leq 6)} = \cdf(-0.1405)\)
\(\hphantom{\prob(Y \leq 6)} = 0.4441\)\(\prob(Y > 4) = 1 - \prob(Y \leq 4)\)
\(\hphantom{\prob(Y > 4)} = 1 - \prob(X \leq \ln(4))\)
\(\hphantom{\prob(Y > 4)} = 1 - \cdf(\frac{\ln(4)-3.9}{15})\)
\(\hphantom{\prob(Y > 4)} = 1 - \cdf(-0.1676)\)
\(\hphantom{\prob(Y > 4)} = 0.5665\)\(\prob(3 < Y \leq 12) = \prob(Y \leq 12) - \prob(Y \leq 3)\)
\(\hphantom{\prob(3 < Y \leq 12)} = \cdf(\frac{\ln(12)-3.9}{15}) - \cdf(\frac{\ln(3)-3.9}{15})\)
\(\hphantom{\prob(3 < Y \leq 12)} = \cdf(-0.0943) - \cdf(-1868)\)
\(\hphantom{\prob(3 < Y \leq 12)} = 0.4624 - 0.4259\)
\(\hphantom{\prob(3 < Y \leq 12)} = 0.0365\)\(\prob(Y \leq 0) = \prob(X \leq -\infty) = 0\)
Exercise 11.4
Suppose that \(X=\ln(Y)\) is a normally distributed random variable with mean \(\mu=2.7\) and standard deviation \(\sigma=1\). Compute the
90%,
95%, and
99%
confidence interval for \(X\) and report the corresponding values for \(Y\).
Solution to Exercise 11.4
The \((1 - \alpha)\) confidence interval (CI) for \(X\) is given by \([\mu - z_{\alpha/2} \sigma, \mu + z_{\alpha/2} \sigma]\). Remember that to compute \(z_{\alpha/2}\) we use in Excel =norm.s.inv(1-alpha/2)
. The corresponding interval for \(Y\) is then \([e^{\mu - z_{\alpha/2} \sigma}, e^{\mu + z_{\alpha/2} \sigma}]\).
\(z_{0.05} = 1.64\) so the 90% CI for \(X\) is \([1.06, 4.34]\), and the corresponding values for \(Y\) are \([2.87, 77.08]\).
\(z_{0.025} = 1.96\) so the 95% CI for \(X\) is \([0.74, 4.66]\), and the corresponding values for \(Y\) are \([2.10, 105.63]\).
\(z_{0.005} = 2.58\) so the 99% CI for \(X\) is \([0.12, 5.28]\), and the corresponding values for \(Y\) are \([1.13, 195.55]\).
Exercise 11.5
Let \(Y = e^{\mu + \sigma Z}\) where \(\mu = 1\), \(\sigma = 2\) and \(Z \sim \Normal(0, 1)\). Compute:
\(\ev(Y)\)
\(\stdev(Y) = \sqrt{\ev(Y^{2}) - \ev(Y)^{2}}\)
\(\ev(Y^{0.3})\)
\(\ev(Y^{-1})\)
Solution to Exercise 11.5
In some of the questions we use the fact that if \(X \sim \Normal(\mu, \sigma^{2})\), then \(\alpha X \sim \Normal(\alpha\mu, \alpha^{2}\sigma^{2})\), which implies that \(\ev(Y^{\alpha}) = \ev(e^{\alpha X}) = e^{\alpha\mu+\frac{1}{2}\alpha^{2}\sigma^{2}}\).
\(\ev(Y) = e^{1+\frac{1}{2}2^{2}} = 20.09\)
\(\ev(Y^{2}) = e^{(2)(1)+\frac{1}{2}(2)^{2}2^{2}}\)
\(\hphantom{\ev(Y^{2})} = 22026.47\),
\(\left(\ev(Y)\right)^{2} = (20.09)^{2}\)
\(\hphantom{\left(\ev(Y)\right)^{2}} = 403.43\),
\(\stdev(Y) = \sqrt{22026.47-403.43}\)
\(\hphantom{\stdev(Y)} = 147.05\).\(\ev(Y^{0.3}) = e^{(0.3)(1)+\frac{1}{2}(0.3)^{2}2^{2}} = 1.62\)
\(\ev(Y^{-1}) = e^{(-1)(1)+\frac{1}{2}(-1)^{2}2^{2}} = 2.72\)