Probability Basics

This document is a brief introduction to probability concepts that will be used throughout the course. For clarity, I present the main results using a discrete probability space. However, all results extend to the general case, where (\Omega, \mathcal{F}, \operatorname{P}) consists of an arbitrary sample space \Omega, a \sigma-algebra \mathcal{F} of subsets of \Omega, and a probability measure \operatorname{P} defined on \mathcal{F}.

For a more complete treatment, see introductory textbooks such as Ross (2019) and Blitzstein and Hwang (2019). For more advanced treatments, see Grimmett and Stirzaker (2001) and Durrett (2019). For a classic treatment, see Feller (1968). For a computational treatment with Python, see Unpingco (2019).

Ross, Sheldon M. 2019. A First Course in Probability. 11th ed. Pearson.

Blitzstein, Joseph K., and Jessica Hwang. 2019. Introduction to Probability. 2nd ed. Chapman; Hall/CRC.

Grimmett, Geoffrey, and David Stirzaker. 2001. Probability and Random Processes. 3rd ed. Oxford University Press.

Durrett, Rick. 2019. Probability: Theory and Examples. 5th ed. Cambridge University Press.

Feller, William. 1968. An Introduction to Probability Theory and Its Applications. Volume 1. 3rd ed. Wiley.

Unpingco, Jose C. 2019. Python for Probability, Statistics, and Machine Learning. 2nd ed. Springer.

Sets

A set is a collection of objects. The objects of a set can be anything you want. For example, a set may contain numbers, letters, cars, or pictures. In our case, we will be concerned with sets that contain future possibilities or outcomes that can occur.

Sets are fundamental in probability theory because events are sets of outcomes. Once outcomes are organized as sets, operations such as unions, intersections, and complements translate directly into statements like “or”, “and”, and “not”, which allows us to define probabilities consistently and prove core probability rules.

One way to define a set is to enumerate its elements. For example, the set of all integers from 1 to 10 is A = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}. Once we have defined a set, we can answer if an object is an element of the set or not. For example, the number 3 is an element of A whereas the number 20 is not. We use the symbol \in to denote membership of a set and \notin to denote non-membership. Therefore, we have that 3 \in A and 20 \notin A.

Some sets can have an infinite number of elements. For example, the natural numbers are defined as \mathbb{N} = \{0, 1, 2, 3, \ldots\}, where the triple dots mean that if n is in \mathbb{N}, then n+1 is also in \mathbb{N}.

Since all elements of A are also members of \mathbb{N}, we say that A is a subset of \mathbb{N} and write it as A \subset \mathbb{N}. Using this terminology, we can redefine the set A defined above in a more Pythonic way: A = \{ n \in \mathbb{N} : n < 11 \}. If we are studying sets of natural numbers, it makes sense to define the universe to be \mathbb{N} and sets under study will be subsets of the universe.

Now, define the set B as B = \{6, 7, 8, 9, 10, 11, 12, 13, 14, 15\}.

The intersection between A and B is the set denoted A \cap B whose members are both in A and B. Using the sets defined above, we have that A \cap B = \{6, 7, 8, 9, 10\}. The union of the sets A and B is the set denoted A \cup B whose members are either in A, B, or both. Thus, using our previously defined sets we have that A \cup B = \{1, 2, 3, \ldots , 14, 15\}. The set difference of A and B is the set denoted A \setminus B whose members are in A but are not members of B. Thus, A \setminus B = \{1, 2, 3, 4, 5\} and B \setminus A = \{11, 12, 13, 14, 15\}. The complement of A is the set denoted by A^{C} whose members are not in A. Of course this statement only makes sense if we define a universe where the elements not in A can live. If the universe is \mathbb{N}, then A^{C} = \mathbb{N} \setminus A = \{11, 12, 13, \ldots\}. Similarly, B^{C} = \{0, 1, 2, 3, 4, 5\} \cup \{16, 17, 18, \ldots\}. Note that if you take all the elements of A out of A you end up with an empty set, that is A \setminus A = \{\}. We typically denote the empty set by \emptyset, but it is good to keep in mind that \emptyset = \{\}. In our universe of natural numbers, no natural number is a member of the empty set. We can write this formally as n \notin \emptyset, \forall n \in \mathbb{N}. Thus, the empty set is a subset of any subset of \mathbb{N}.

The cardinality of the set A, denoted by |A|, counts the number of elements in A. We then have that |A| = |B| = 10. The empty set has cardinality 0 whereas the cardinality of \mathbb{N} is denoted \aleph_{0}.

The power set of a set C, denoted by \mathcal{P}(C), is the set containing all possible subsets of C. For example, if C = \{1, 2, 3\}, then \mathcal{P}(C) = \{\{\}, \{1\}, \{2\}, \{3\}, \{1, 2\}, \{2, 3\}, \{1, 3\}, \{1, 2, 3\}\}. Clearly, the power sets of A and B are much bigger. For a given set A, the cardinality of its power set is 2^{|A|}. Therefore, \mathcal{P}(A) and \mathcal{P}(B) each contain 2^{10} = 1024 subsets.

Finally, the Cartesian product of A and B is the set denoted by A \times B whose members are all the pairwise combinations of the elements of A and B. \begin{array}{c|cccc} A \times B & 6 & 7 & \dots & 15 \\ \hline 1 & (1, 6) & (1, 7) & \dots & (1, 15) \\ 2 & (2, 6) & (2, 7) & \dots & (2, 15) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 10 & (10, 6) & (10, 7) & \dots & (10, 15) \end{array}

The cardinality of A \times B is equal to the product of the cardinalities of A and B, i.e., |A \times B| = |A| \times |B|.

Outcomes and Events

In probability theory, a finite sample space is a non-empty finite set denoted by \Omega. The sample space includes all possible outcomes that can occur. A probability measure is a function that assigns to each element \omega of \Omega a number in [0, 1] so that \sum_{\omega \in \Omega} \operatorname{P}(\omega) = 1. An event A is a subset of \Omega, and we define the probability of that event occurring as \operatorname{P}(A) = \sum_{\omega \in A} \operatorname{P}(\omega). \tag{1} We usually denote the set of all events by \mathcal{F}, which for convenience we will take to be the power set of \Omega, i.e., \mathcal{F} = \mathcal{P}(\Omega). Thus, we have that \mathcal{F} contains all possible subsets of \Omega and \operatorname{P} is defined for all those subsets.

The expression (\Omega, \mathcal{F}, \operatorname{P}) then defines a finite probability space, where \Omega is the sample space, \mathcal{F} is the set of events, and \operatorname{P} is the probability measure. Since \mathcal{F} = \mathcal{P}(\Omega), we can simply write (\Omega, \operatorname{P}) to denote the same probability space.

An immediate consequence of (1) is that \operatorname{P}(\Omega) = 1. Furthermore, if A and B are disjoint sets of \Omega we have that \begin{aligned} \operatorname{P}(A \cup B) & = \sum_{\omega \in A \cup B} \operatorname{P}(\omega) \\ & = \sum_{\omega \in A} \operatorname{P}(\omega) + \sum_{\omega \in B} \operatorname{P}(\omega) \\ & = \operatorname{P}(A) + \operatorname{P}(B). \end{aligned} If we denote by A^{C} the complement of A in \Omega, the last expression implies that \operatorname{P}(A) + \operatorname{P}(A^{C}) = 1. Also, because \Omega^{C} = \emptyset, we also have that \operatorname{P}(\Omega) + \operatorname{P}(\emptyset) = 1, or \operatorname{P}(\emptyset) = 0.

Example 1 If \Omega = \{ \omega_{1}, \omega_{2}, \omega_{3} \}, then \begin{aligned} \mathcal{P}(\Omega) & = \{ \emptyset, \{\omega_{1}\}, \{\omega_{2}\}, \{\omega_{3}\}, \{\omega_{1}, \omega_{2}\}, \{\omega_{2}, \omega_{3}\}, \{\omega_{1}, \omega_{3}\}, \{\omega_{1}, \omega_{2}, \omega_{3}\}\} \end{aligned} defines the collection of all possible events that we can measure. As we saw previously, the cardinality of \mathcal{P}(\Omega) grows exponentially with the size of \Omega.

The function \operatorname{P} such that \operatorname{P}(\omega_{1}) = 1/2, \operatorname{P}(\omega_{2}) = 1/4, and \operatorname{P}(\omega_{3}) = 1/4 defines a probability measure on \Omega. For example, we have that \operatorname{P}(\{\omega_{1}, \omega_{3}\}) = 1/2 + 1/4 = 3/4.

More generally, if \{A_{i} : i \in I\} is a collection of pairwise disjoint subsets of \Omega, then no outcome \omega belongs to more than one A_{i}. In this case, the probability of their union is simply the sum of their probabilities: \operatorname{P}\left(\bigcup_{i \in I} A_{i}\right) = \sum_{i \in I} \operatorname{P}(A_{i}). This property is called countable additivity.

Random Variables

Definition

A random variable X is a function that assigns a real value to each outcome: X(\omega) for \omega \in \Omega. Several outcomes may have the same value of X.

Example 2 Consider a sample space with four possible outcomes \Omega = \{ \omega_{1}, \omega_{2}, \omega_{3}, \omega_{4} \}. The table below describes the possible values of three random variables denoted by X, Y and Z.

Outcome	X	Y	Z
\omega_{1}	-10	20	15
\omega_{2}	-5	10	-10
\omega_{3}	5	0	15
\omega_{4}	10	0	-10

Observing the values of X provides perfect information about which event happened. For example, if X = 5 then we know that \omega_{3} occurred.

Knowing the values of Y or Z, on the other hand, does not provide the same amount of information. If we learn that Y = 0 we only know that either \omega_{3} or \omega_{4} occurred. If we denote by \mathcal{F}_{Y} the set of events that can be generated by Y, we have that \mathcal{F}_{Y} = \{ \emptyset, \{\omega_{1}\}, \{\omega_{2}\}, \{\omega_{1}, \omega_{2}\}, \{\omega_{3}, \omega_{4}\}, \{\omega_{1}, \omega_{3}, \omega_{4}\}, \{\omega_{2}, \omega_{3}, \omega_{4}\}, \Omega\}. The information set provided by Z is even smaller, since \mathcal{F}_{Z} = \{ \emptyset, \{\omega_{1}, \omega_{3}\}, \{\omega_{2}, \omega_{4}\}, \Omega\}. Thus, a random variable does not necessarily provide all the information generated by the probability space \Omega.

Expectation and Variance

If X is a random variable defined on a finite probability space (\Omega, \operatorname{P}), the expectation or expected value of X is defined to be \operatorname{E}X = \sum_{\omega \in \Omega} X(\omega) \operatorname{P}(\omega), whereas the variance of X is \operatorname{V}(X) = \operatorname{E}(X - \operatorname{E}X)^{2}. The standard deviation is the square-root of the variance, i.e., \sigma_{X} = \sqrt{\operatorname{V}(X)}.

Example 3 Consider the sample space \Omega = \{ \omega_{1}, \omega_{2}, \omega_{3} \} in which we define the probability measure \operatorname{P} such that \operatorname{P}(\omega_{1}) = 1/2, \operatorname{P}(\omega_{2}) = 1/4, and \operatorname{P}(\omega_{3}) = 1/4. There are two random variables X and Y that take values in \Omega according to the table below.

Outcome	Probability	X	Y
\omega_{1}	1/2	10	2
\omega_{2}	1/4	8	40
\omega_{3}	1/4	4	20

Using this information, we can compute the expectation of each random variable.

\begin{aligned} \operatorname{E}X & = \frac{1}{2} \times 10 + \frac{1}{4} \times 8 + \frac{1}{4} \times 4 = 8, \\ \operatorname{E}Y & = \frac{1}{2} \times 2 + \frac{1}{4} \times 40 + \frac{1}{4} \times 20 = 16. \end{aligned} Having computed the expectations of X and Y, we can compute their variances as \begin{aligned} \operatorname{V}(X) & = \frac{1}{2} \times (10 - 8)^{2} + \frac{1}{4} \times (8 - 8)^{2} + \frac{1}{4} \times (4 - 8)^{2} = 6, \\ \operatorname{V}(Y) & = \frac{1}{2} \times (2 - 16)^{2} + \frac{1}{4} \times (40 - 16)^2 + \frac{1}{4} \times (20 - 16)^2 = 246. \end{aligned} Finally, the standard deviations of X and Y are \sigma_{X} = \sqrt{6} \approx 2.45 and \sigma_{Y} = \sqrt{246} \approx 15.68, respectively.

Covariance

The covariance between two random variables X and Y defined on a probability space (\Omega, \operatorname{P}) is defined as \operatorname{Cov}(X, Y) = \operatorname{E}(X - \operatorname{E}X) (Y - \operatorname{E}Y), and their correlation is \rho_{X, Y} = \frac{\operatorname{Cov}(X, Y)}{\sigma_{X} \sigma_{Y}}. The correlation between any two random variables is always between -1 and 1.

Proof

Let \sigma_{X} and \sigma_{Y} denote the standard deviations of X and Y, respectively. We can then compute \begin{aligned} \operatorname{E}((X - \operatorname{E}X) \sigma_{Y} + (Y - \operatorname{E}Y) \sigma_{X})^{2} & = (\sigma_{X}^{2} \sigma_{Y}^{2} + 2 \sigma_{X} \sigma_{Y} \operatorname{Cov}(X, Y) + \sigma_{Y}^{2} \sigma_{X}^{2}) \\ & = 2 \sigma_{X} \sigma_{Y} (\sigma_{X} \sigma_{Y} + \operatorname{Cov}(X, Y)), \end{aligned} which implies \sigma_{X} \sigma_{Y} + \operatorname{Cov}(X, Y) \geq 0 or - \sigma_{X} \sigma_{Y} \leq \operatorname{Cov}(X, Y).

Similarly, \begin{aligned} \operatorname{E}((X - \operatorname{E}X) \sigma_{Y} - (Y - \operatorname{E}Y) \sigma_{X})^{2} & = (\sigma_{X}^{2} \sigma_{Y}^{2} - 2 \sigma_{X} \sigma_{Y} \operatorname{Cov}(X, Y) + \sigma_{Y}^{2} \sigma_{X}^{2}) \\ & = 2 \sigma_{X} \sigma_{Y} (\sigma_{X} \sigma_{Y} - \operatorname{Cov}(X, Y)), \end{aligned} which implies \sigma_{X} \sigma_{Y} - \operatorname{Cov}(X, Y) \geq 0 or \operatorname{Cov}(X, Y) \leq \sigma_{X} \sigma_{Y}.

Thus, we conclude that -1 \leq \frac{\operatorname{Cov}(X, Y)}{\sigma_{X} \sigma_{Y}} \leq 1, or equivalently -1 \leq \rho_{X,Y} \leq 1.

Example 4 Continuing with Example 3, we have that \operatorname{Cov}(X, Y) = \frac{1}{2} \times (10 - 8)(2 - 16) + \frac{1}{4} (8 - 8)(40 - 16) + \frac{1}{4} (4 - 8)(20 - 16) = -18. Thus, \rho_{X, Y} \approx -0.47.

The covariance of X and Y can also be expressed as \operatorname{Cov}(X, Y) = \operatorname{E}(X Y) - \operatorname{E}(X) \operatorname{E}(Y).

Proof

\begin{aligned} \operatorname{Cov}(X, Y) & = \operatorname{E}(X - \operatorname{E}(X))(Y - \operatorname{E}(Y)) \\ & = \operatorname{E}[X (Y - \operatorname{E}(Y))] - \operatorname{E}[\operatorname{E}(X) (Y - \operatorname{E}(Y))] \\ & = \operatorname{E}(XY) - \operatorname{E}[X \operatorname{E}(Y)] - \operatorname{E}(X) \operatorname{E}(Y - \operatorname{E}(Y)) \\ & = \operatorname{E}(XY) - \operatorname{E}(X) \operatorname{E}(Y). \\ \end{aligned}

Probability Mass Function

For discrete random variables, the probability mass function (or pmf) is a real-valued function that specifies the probability that the random variable X is equal to a certain value x, i.e., p_{X}(x) = \operatorname{P}(\omega \in \Omega : X(\omega) = x).

Example 5 Suppose we define a probability measure \operatorname{P} to the random variables X and Y defined in Example 2 according to the table below.

Outcome	\operatorname{P}	X	Y
\omega_{1}	0.10	-10	20
\omega_{2}	0.30	-5	10
\omega_{3}	0.40	5	0
\omega_{4}	0.20	10	0

We have that the probability mass function of X is p_{X}(x) = \begin{cases} 0.10 & \text{if } x = -10, \\ 0.30 & \text{if } x = -5, \\ 0.40 & \text{if } x = 5, \\ 0.20 & \text{if } x = 10. \end{cases}

The probability mass function of Y takes positive values only at three points. p_{Y}(y) = \begin{cases} 0.60 & \text{if } y = 0, \\ 0.30 & \text{if } y = 10, \\ 0.10 & \text{if } y = 20. \end{cases}

It is sometimes easier to visualize the probability mass function by plotting the probability of different values of the random variable.

Two side-by-side stem plots of discrete probability mass functions. The left plot shows X taking values minus 10, minus 5, 5, and 10 with probabilities 0.10, 0.30, 0.40, and 0.20. The right plot shows Y taking values 0, 10, and 20 with probabilities 0.60, 0.30, and 0.10. — (a) The function p_{X}(x) defines the probability of X being equal to x = \{-10, -5, 5, 10\}.

It is apparent from the pictures that p_{X}(x) = 0 if x \notin \{-10, -5, 5, 10\}. Indeed, the set \{\omega \in \Omega : X(\omega) = x \} is empty for all x not equal to -10, -5, 5, or 10. Similarly, p_{Y}(y) = 0 if y \notin \{0, 10, 20\}.

To simplify notation, we will often write \{X = x\} to denote the set \{\omega \in \Omega : X(\omega) = x\}. Using this notation, we have that p_{X}(x) = \operatorname{P}(X = x).

The support of X is the set R_{X} = \{x \in \mathbb{R} : p_{X}(x) > 0\}, which is countable because \Omega is countable. Similarly, the support of Y is R_{Y} = \{y \in \mathbb{R} : p_{Y}(y) > 0\}. Using this notation we can rewrite the expectation of X as \operatorname{E}(X) = \sum_{x \in R_{X}} x\, p_{X}(x). \tag{2} which is commonly used in statistics. The variance of X then becomes \operatorname{V}(X) = \sum_{x \in R_{X}} (x - \operatorname{E}(X))^{2}\, p_{X}(x). Note that we have \begin{aligned} \operatorname{V}(X) & = \sum_{x \in R_{X}} (x - \operatorname{E}(X))^{2}\, p_{X}(x) \\ & = \sum_{x \in R_{X}} (x^{2} - 2 x \operatorname{E}(X) + \operatorname{E}(X)^{2})\, p_{X}(x) \\ & = \sum_{x \in R_{X}} x^{2}\, p_{X}(x) - 2 \operatorname{E}(X) \sum_{x \in R_{X}} x\, p_{X}(x) + \operatorname{E}(X)^{2} \\ & = \operatorname{E}(X^{2}) - 2 \operatorname{E}(X)^{2} + \operatorname{E}(X)^{2} \\ & = \operatorname{E}(X^{2}) - \operatorname{E}(X)^{2}. \end{aligned}

For two random variables X and Y defined in (\Omega, \operatorname{P}), the set \{X = x, Y = y\} denotes all outcomes in \Omega that satisfy \{X = x\} and \{Y = y\}. Therefore, we have that \{X = x, Y = y\} = \{X = x\} \cap \{Y = y\}. The function p_{X, Y}(x, y) = \operatorname{P}(X = x, Y = y) is called the joint probability mass function of X and Y.

Example 6 The joint pmf of the random variables defined in Example 5 is given in the table below.

\begin{array}{c|cccc} X \setminus Y & 0 & 10 & 20 \\ \hline -10 & 0 & 0 & 0.1 \\ -5 & 0 & 0.3 & 0 \\ 5 & 0.4 & 0 & 0 \\ 10 & 0.2 & 0 & 0 \end{array} The function p_{X, Y}(x, y) has many zeros since in Example 5 there are only four outcomes. Any other outcome then has probability zero of occurring.

Example 7 We can generate any joint pmf for two random variables as long as the sum of all probabilities is equal to one. The table below reports the joint probabilities of a random variable X taking values in [-1, 0, 1] and a random variable Y taking values in [0, 1, 2, 3]. \begin{array}{c|cccc} X \setminus Y & 0 & 1 & 2 & 3 \\ \hline -1 & 0.12500 & 0.09375 & 0.06250 & 0.03125 \\ 0 & 0.06250 & 0.12500 & 0.12500 & 0.06250 \\ 1 & 0.03125 & 0.06250 & 0.09375 & 0.12500 \end{array}

In this case the underlying probability space has at least 3 \times 4 = 12 possible outcomes. The figure below plots the joint pmf of X and Y.

To plot the joint pmf of two random variables we need a three-dimensional graph.

A three-dimensional stem plot of the joint probability mass function for X and Y, with X values minus 1, 0, and 1 on one axis and Y values 0 through 3 on the other. Stem heights represent joint probabilities, showing how probability mass is distributed across the twelve possible X-Y combinations. — Figure 2: The figure plots the joint probability mass function of X and Y in Example 7.

We can use the joint pmf to compute the expectation of a function of two random variables. Indeed, we have that \operatorname{E}(f(X, Y)) = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} f(x, y)\, p_{X, Y}(x, y). \tag{3} If we write \mu_{X} = \operatorname{E}(X) and \mu_{Y} = \operatorname{E}(Y), equation (3) implies that the covariance of X and Y can be computed as \operatorname{Cov}(X, Y) = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} (x - \mu_{X})(y - \mu_{Y}) p_{X, Y}(x, y). The joint pmf contains all the information of X and Y since we can recover the individual pmfs of X and Y from it. To find the probability that Y equals a specific value y, we sum over all possible values of X: p_{Y}(y) = \sum_{x \in R_{X}} p_{X, Y}(x, y). This works because the events \{X = x, Y = y\} for different x are disjoint and together cover all ways Y can be y. Similarly, p_{X}(x) = \sum_{y \in R_{Y}} p_{X, Y}(x, y). Therefore, we can marginalize out Y to obtain the probability mass function of X, in the same way that we can marginalize out X to obtain the probability mass function of Y. It is important to note that the joint pmf not only contains the individual information of two random variables but also captures their mutual dependence.

Independence

We say that two events A and B are independent if \operatorname{P}(A \cap B) = \operatorname{P}(A) \operatorname{P}(B).

Example 8 Suppose that the weather tomorrow can be either sunny, fair or rainy. In addition, a certain stock tomorrow can either go up or down in price.

We can define W = \{\text{sunny}, \text{fair}, \text{rainy}\} and S = \{\text{up}, \text{down}\}. The set of outcomes can be described as all possible pairwise combinations of weather tomorrow and the stock price movement. The sample space \Omega is then the Cartesian product of W and S, i.e., \Omega = W \times S.

We can then define the weather events \begin{aligned} \text{Sunny} & = \{(\text{sunny}, \text{up}), (\text{sunny}, \text{down})\}, \\ \text{Fair} & = \{(\text{fair}, \text{up}), (\text{fair}, \text{down})\}, \\ \text{Rainy} & = \{(\text{rainy}, \text{up}), (\text{rainy}, \text{down})\}. \end{aligned}

The table below describes the probabilities for tomorrow’s weather.

Weather	Sunny	Fair	Rainy
Probability	0.3	0.5	0.2

Similarly, the stock events can be defined as \begin{aligned} \text{Up} & = \{(\text{sunny}, \text{up}), (\text{fair}, \text{up}), (\text{rainy}, \text{up})\}, \\ \text{Down} & = \{(\text{sunny}, \text{down}), (\text{fair}, \text{down}), (\text{rainy}, \text{down})\}. \\ \end{aligned}

The probabilities of the stock price going up or down are described in the table below.

Stock	Up	Down
Probability	0.6	0.4

If the weather does not affect the likelihood of the stock going up or down, we should expect to see on sunny days 60% of the time the stock going up and 40% of those days the stock going down.

That is, if the weather tomorrow and the stock price movement are independent events, we should expect \operatorname{P}(\text{Stock} \cap \text{Weather}) = \operatorname{P}(\text{Stock}) \operatorname{P}(\text{Weather}), where \text{Stock} is either \text{Up} or \text{Down}, and \text{Weather} is either \text{Sunny}, \text{Fair}, or \text{Rainy}.

The table below describes the combined probabilities of the stock price movement and the weather tomorrow that are consistent with the independence of those events.

Stock\Weather	Sunny	Fair	Rainy
Up	0.18	0.30	0.12
Down	0.12	0.20	0.08

In the table, the weather does not change the relative proportions of the probabilities for the stock price.

The previous example shows how to generate independent events out of two finite probability spaces (\Omega_{1}, \operatorname{P}_{1}) and (\Omega_{2}, \operatorname{P}_{2}). If we define \Omega = \Omega_{1} \times \Omega_{2} and let \operatorname{P}(\omega_{1}, \omega_{2}) = \operatorname{P}_{1}(\omega_{1}) \operatorname{P}_{2}(\omega_{2}) for each \omega_{1} \in \Omega_{1} and \omega_{2} \in \Omega_{2}, the pair (\Omega, \operatorname{P}) is a well-defined finite probability space. In this new probability space, the events A = \{\omega_{1}\} \times \Omega_{2} and B = \Omega_{1} \times \{\omega_{2}\} are independent for any \omega_{1} \in \Omega_{1} and \omega_{2} \in \Omega_{2}.

Proof

We have that \begin{aligned} \operatorname{P}(A) & = \sum_{\omega_{2} \in \Omega_{2}} \operatorname{P}(\omega_{1}, \omega_{2}) \\ & = \sum_{\omega_{2} \in \Omega_{2}} \operatorname{P}_{1}(\omega_{1}) \operatorname{P}_{2}(\omega_{2}) \\ & = \operatorname{P}_{1}(\omega_{1}) \sum_{\omega_{2} \in \Omega_{2}} \operatorname{P}_{2}(\omega_{2}) \\ & = \operatorname{P}_{1}(\omega_{1}). \end{aligned} Similarly, \operatorname{P}(B) = \operatorname{P}_{2}(\omega_{2}). Since A \cap B = \{(\omega_{1}, \omega_{2})\}, we have that \operatorname{P}(A \cap B) = \operatorname{P}(A) \operatorname{P}(B), proving that A and B are independent.

Example 9 The sample space \Omega is always independent from any event A \subset \Omega since \operatorname{P}(A \cap \Omega) = \operatorname{P}(A) = \operatorname{P}(A) \operatorname{P}(\Omega). Intuitively, conditioning on \Omega does not change the probability of A.

Two random variables X and Y are independent if the events \{X = x\} and \{Y = y\} are independent. Thus, if X and Y are independent we have that \operatorname{P}(X = x, Y = y) = \operatorname{P}(X = x) \operatorname{P}(Y = y), or equivalently p_{X, Y}(x, y) = p_{X}(x) p_{Y}(y).

An important consequence of independence is that if X and Y are two independent random variables, then \operatorname{E}(XY) = \operatorname{E}(X) \operatorname{E}(Y). \tag{4}

Proof

\begin{aligned} \operatorname{E}(X Y) & = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} x y \, p_{X, Y}(x, y) \\ & = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} x y \, p_{X}(x) p_{Y}(y) \\ & = \sum_{y \in R_{Y}} y \, p_{Y}(y) \sum_{x \in R_{X}} x \, p_{X}(x) \\ & = \sum_{y \in R_{Y}} y \, p_{Y}(y) \operatorname{E}(X) \\ & = \operatorname{E}(X) \sum_{y \in R_{Y}} y \, p_{Y}(y) \\ & = \operatorname{E}(X) \operatorname{E}(Y). \end{aligned}

Equation (4) implies that if X and Y are independent, their covariance is equal to zero. Indeed, \begin{aligned} \operatorname{Cov}(X, Y) & = \operatorname{E}(XY) - \operatorname{E}(X) \operatorname{E}(Y) \\ & = \operatorname{E}(X) \operatorname{E}(Y) - \operatorname{E}(X) \operatorname{E}(Y) \\ & = 0. \end{aligned} However, the opposite statement is not true.

Example 10 Consider a random variable X that takes the values \{1, 0, -1\}, each with probability 1 /3. We compute: \operatorname{E}(X) = \frac{1}{3}(1) + \frac{1}{3}(0) + \frac{1}{3}(-1) = 0, and \operatorname{E}(X^{3}) = \frac{1}{3}(1^{3}) + \frac{1}{3}(0^{3}) + \frac{1}{3}((-1)^{3}) = 0. Now, define Y = X^{2}. The covariance between X and Y is \operatorname{Cov}(X, Y) = \operatorname{E}(XY) - \operatorname{E}(X)\operatorname{E}(Y) = \operatorname{E}(X^{3}) - \operatorname{E}(X)\operatorname{E}(X^{2}) = 0. This example shows that X and Y are uncorrelated (zero covariance), but they are not independent—Y is completely determined by X.

Linear Combinations

In portfolio theory, we usually study linear combinations of random variables of the form Z = \alpha X + \beta Y. The expectation of Z is just a linear combination of the expectations of X and Y, \operatorname{E}Z = \alpha \operatorname{E}X + \beta \operatorname{E}Y. \tag{5} The variance of Z, though, includes not only the variances of X and Y but also their covariances, \operatorname{V}(Z) = \alpha^{2} \operatorname{V}(X) + \beta^{2} \operatorname{V}(Y) + 2 \alpha \beta \operatorname{Cov}(X, Y). \tag{6} This is an important result which is at the heart of portfolio diversification.

Proof

The expectation of Z is computed as \begin{aligned} \operatorname{E}Z & = \operatorname{E}(\alpha X + \beta Y) \\ & = \sum_{\omega \in \Omega} (\alpha X(\omega) + \beta Y(\omega)) \operatorname{P}(\omega) \\ & = \alpha \sum_{\omega \in \Omega} X(\omega) \operatorname{P}(\omega) + \beta \sum_{\omega \in \Omega} Y(\omega) \operatorname{P}(\omega) \\ & = \alpha \operatorname{E}X + \beta \operatorname{E}Y. \end{aligned}

The variance of Z is computed as \begin{aligned} \operatorname{V}(Z) & = \operatorname{V}(\alpha X + \beta Y) \\ & = \operatorname{E}(\alpha X + \beta Y - \operatorname{E}(\alpha X + \beta Y))^{2} \\ & = \operatorname{E}(\alpha (X - \operatorname{E}X) + \beta (Y - \operatorname{E}Y))^{2} \\ & = \operatorname{E}(\alpha^{2} (X - \operatorname{E}X)^{2} + \beta^{2} (Y - \operatorname{E}Y)^{2} + 2 \alpha \beta (X - \operatorname{E}X)(Y - \operatorname{E}Y)) \\ & = \alpha^{2} \operatorname{E}(X - \operatorname{E}X)^{2} + \beta^{2} \operatorname{E}(Y - \operatorname{E}Y)^{2} + 2 \alpha \beta \operatorname{E}(X - \operatorname{E}X)(Y - \operatorname{E}Y) \\ & = \alpha^{2} \operatorname{V}(X) + \beta^{2} \operatorname{V}(Y) + 2 \alpha \beta \operatorname{Cov}(X, Y). \end{aligned}

More generally, consider the random variables X_{1}, X_{2}, \ldots, X_{n}, and form a new random variable X such that X = \alpha_{1} X_{1} + \alpha_{2} X_{2} + \ldots + \alpha_{n} X_{n}, where \alpha_{i} \in \mathbb{R} for all i \in \{1, 2, \ldots, n\}.

The expectation of X is a linear combination of the expectations of X_{1}, X_{2}, \ldots, X_{n}. The variance of X, though, takes into account all covariances between X_{i} and X_{j}, for i, j = 1, 2, \ldots, n. Indeed, we have that \operatorname{V}(X) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} \alpha_{i} \alpha_{j} \operatorname{Cov}(X_{i}, X_{j}). \tag{7} The previous expression can be simplified if the random variables X_{1}, X_{2}, \ldots, X_{n} are independent from each other. In such case, we have that \operatorname{Cov}(X_{i}, X_{j}) = 0 for all i \neq j. Recognizing that \operatorname{Cov}(X_{i}, X_{i}) = \operatorname{V}(X_{i}), equation (7) implies that \operatorname{V}(X) = \sum_{i = 1}^{n} \alpha_{i}^{2} \operatorname{V}(X_{i}). \tag{8}

Example 11 Suppose that X_{1}, X_{2}, \ldots, X_{n} are independent random variables with the same variance denoted by \sigma^{2}. Define X to be the sum of these random variables so that X = X_{1} + X_{2} + \ldots + X_{n}. Equation (8) implies that \operatorname{V}(X) = \sum_{i = 1}^{n} \operatorname{V}(X_{i}) = n \sigma^{2}.

Conditional Probability

For two events A and B with \operatorname{P}(B) > 0, the conditional probability of A given B is \operatorname{P}(A \mid B) = \frac{\operatorname{P}(A \cap B)}{\operatorname{P}(B)}. Similarly, for x such that p_{X}(x) > 0, the conditional probability mass function of Y given X = x is p_{Y \mid X}(y \mid x) = \frac{p_{X, Y}(x, y)}{p_{X}(x)}.

Example 12 Suppose we roll a fair six-sided die and define the events A = \{2, 4, 6\} \quad \text{and} \quad B = \{4, 5, 6\}. Then A \cap B = \{4, 6\}, \qquad \operatorname{P}(B) = \frac{3}{6}, \qquad \operatorname{P}(A \cap B) = \frac{2}{6}. Therefore, the conditional probability of A given B is \operatorname{P}(A \mid B) = \frac{\operatorname{P}(A \cap B)}{\operatorname{P}(B)} = \frac{2/6}{3/6} = \frac{2}{3}.

The conditional expectation of Y given X = x is \operatorname{E}(Y \mid X = x) = \sum_{y \in R_{Y}} y\, p_{Y \mid X}(y \mid x). This is a function of x, and we can define the random variable \operatorname{E}(Y \mid X) by assigning to each outcome \omega the value \operatorname{E}(Y \mid X = X(\omega)).

A key result in probability theory is the law of iterated expectations: \begin{aligned} \operatorname{E}(\operatorname{E}(Y \mid X)) & = \sum_{x \in R_{X}} \operatorname{E}(Y \mid X = x)\, p_{X}(x) \\ & = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} y\, p_{Y \mid X}(y \mid x)\, p_{X}(x) \\ & = \sum_{x \in R_{X}} \sum_{y \in R_{Y}} y\, p_{X, Y}(x, y) \\ & = \sum_{y \in R_{Y}} y\, p_{Y}(y) \\ & = \operatorname{E}(Y). \end{aligned} This means that the expected value of the conditional expectation equals the expected value of Y itself.

Example 13 Using the joint pmf in Example 7, we compute the conditional expectation of Y given X = 1. From the row X = 1, we have p_{X, Y}(1,0) = 0.03125,\; p_{X, Y}(1,1) = 0.06250,\; p_{X, Y}(1,2) = 0.09375,\; p_{X, Y}(1,3) = 0.12500. Hence p_{X}(1) = 0.03125 + 0.06250 + 0.09375 + 0.12500 = 0.3125. So the conditional pmf is p_{Y \mid X}(y \mid 1) = \left\{0.1,\,0.2,\,0.3,\,0.4\right\} \quad \text{for } y=\{0,1,2,3\}. Therefore, \operatorname{E}(Y \mid X = 1) = \sum_{y \in R_{Y}} y\, p_{Y \mid X}(y \mid 1) = 0(0.1) + 1(0.2) + 2(0.3) + 3(0.4) = 2.