An Introduction to Probability and Simulation

In this section we will introduce certain transformations of random variables for which the expected value of the transformation is the transformation of the expected value. We will also study variance of certain transformations of random variables.

5.6.1 Linear rescaling

A linear rescaling is a transformation of the form \(g(u) = a + bu\) . Recall that in Section 3.8.1 we observed, via simulation, that

A linear rescaling of a random variable does not change the basic shape of its distribution, just the range of possible values.
A linear rescaling transforms the mean in the same way the individual values are transformed.
Adding a constant to a random variable does not affect its standard deviation.
Multiplying a random variable by a constant multiples its standard deviation by the same constant.

Formally, if \(X\) is a random variable and \(a, b\) are non-random constants then

\[\begin \textrm(aX + b) & = a\textrm(X) + b\\ \textrm(aX + b) & = |a|\textrm(X)\\ \textrm(aX + b) & = a^2\textrm(X) \end\]

5.6.2 Linearity of expected value

Example 5.28 Spin the Uniform(1, 4) spinner twice and let \(U_1\) be the first spin, \(U_2\) the second, and \(X = U_1 + U_2\) the sum.

Find \(\textrm(U_1)\) and \(\textrm(U_2)\) .
Find \(\textrm(X)\) .
How does \(\textrm(X)\) relate to \(\textrm(U_1)\) and \(\textrm(U_2)\) ? Suggest a simpler way of finding \(\textrm(U_1 + U_2)\) .

\(\textrm(U_1) = \frac= 2.5 = \textrm(U_2)\) .
We found the pdf of \(X\) in Example 4.15. Since the pdf of \(X\) is symmetric about \(5\) we should have \(\textrm(X)=5\) , which integrating confirms. \[ \textrm(X) = \int_2^5 x \left((x-2)/9\right)dx + \int_5^8 x \left((8-x)/9\right) dx = 5. \]
We see that \(\textrm(U_1+U_2) = 5 = 2.5 + 2.5 = \textrm(U_1) + \textrm(U_2)\) . Finding the expected value of each of \(U_1\) and \(U_2\) and adding these two numbers is much easier than finding the pdf of \(U_1+U_2\) and then using the definition of expected value.

In the previous example, the values \(U_1\) and \(U_2\) came from separate spins so they were unrelated. What about the expected value of \(X+Y\) when \(X\) and \(Y\) are correlated?

Example 5.29 Recall the Colab activity where you simulated pairs of SAT Math ( \(X\) ) and Reading ( \(Y\) ) scores from Bivariate Normal distributions with different correlations. You considered the distribution of the sum \(T=X+Y\) and difference \(D= X - Y\) . Did changing the correlation affect the distribution of \(T\) of \(D\) ? Did changing the correction affect the expected value of \(T\) ? Of \(D\) ?

You should have observed that, yes, changing the correlation affected the distribution of \(T\) and \(D\) mainly by changing the degree of variability. However, you should have also observed that the expected value of \(T\) did not change as the correlation changed (after accounting for simulation margin of error). Similarly, the expected value of \(D\) did not change as the correlation changed.

Linearity of expected value. For any two random variables \(X\) and \(Y\) . \[\begin \textrm(X + Y) & = \textrm(X) + \textrm(Y) \end\] That is, the expected value of the sum is the sum of expected values, regardless of how the random variables are related. Therefore, you only need to know the marginal distributions of \(X\) and \(Y\) to find the expected value of their sum. (But keep in mind that the distribution of \(X+Y\) will depend on the joint distribution of \(X\) and \(Y\) .)

Linearity of expected value follows from simple arithmetic properties of numbers. Whether in the short run or the long run, \[\begin \text & = \text + \text \end\] regardless of the joint distribution of \(X\) and \(Y\) . For example, for the two \((X, Y)\) pairs (4, 3) and (2, 1) \[ \text = \frac = \frac + \frac = \text + \text. \]

A linear combination of two random variables \(X\) and \(Y\) is of the form \(aX + bY\) where \(a\) and \(b\) are non-random constant. Combining properties of linear rescaling with linearity of expected value yields the expected value of a linear combination \[ \textrm(aX + bY) = a\textrm(X)+b\textrm(Y) \] For example, \(\textrm(X - Y) = \textrm(X) - \textrm(Y)\) . The left side above represents the “long way”: find the distribution of \(aX + bY\) , which will depend on the joint distribution of \(X\) and \(Y\) , and then use the definition of expected value. The right side is the “short way”: find the expected values of \(X\) and \(Y\) , which only requires their marginal distributions, and plug those numbers into the transformation formula. Similar to LOTUS, linearity of expected value provides a way to find the expected value of certain random variables without first finding the distribution of the random variables.

Linearity of expected value extends naturally to more than two random variables.

Example 5.30 Recall the matching problem in Example 5.1. We showed that the expected value of the number of matches \(Y\) is \(\textrm(Y)=1\) when \(n=4\) . Now consider a general \(n\) : there are \(n\) rocks that are shuffled and placed uniformly at random in \(n\) spots with one rock per spot. Let \(Y\) be the number of matches. Can you find a general formula for \(\textrm(Y)\) ?

How do you think \(\textrm(Y)\) depends on \(n\) ?
Recall the indicator random variables from Example 2.29. Let \(\textrm_1\) be the indicator that rock 1 is placed correctly in spot 1. Find \(\textrm(\textrm_1)\) .
Let \(\textrm_i\) be the indicator that rock \(i\) is placed correctly in spot \(i\) , \(i=1, \ldots, n\) . Find \(\textrm(\textrm_i)\) .
What is the relationship between \(Y\) and \(\textrm_1, \ldots, \textrm_n\) ?
Find \(\textrm(Y)\) . Be amazed.

There are two common guesses. (1) As \(n\) increases, there are more chances for a match, so maybe \(\textrm(Y)\) increases with \(n\) . (2) But as \(n\) increases the chance that any particular rock goes in the correct spot decreases, so maybe \(\textrm(Y)\) decreases with \(n\) . These considerations move \(\textrm(Y)\) in opposite directions; how do they balance?
Recall that the expected value of an indicator random variable is just the probability of the corresponding event. There are \(n\) rocks which are equally likely to be placed in spot 1, only 1 of which is correct. The probability that rock 1 is correct placed in spot 1 is \(1/n\) . That is, \(\textrm(\textrm_1) = 1/n\) .
If the rocks are placed uniformly at random then no rock is more or less likely than any other to be placed in its correct spot, so the probability and expected value should be the same for all \(i\) . Given any spot \(i\) , any of the \(n\) rocks is equally likely to be placed in spot \(i\) , and only one of those is the correct rock, so \(\textrm(\textrm_i = 1) = 1/n\) , and \(\textrm(I_i) = 1/n\) .
Recall Section 2.3.4. We can count the total number of matches by incrementally adding 1 to our counter each time rock \(i\) matches spot \(i\) for \(i=1, \ldots, n\) . That is, the total number of matches is the sum of the indicator random variables: \(Y=\textrm_1 + \cdots + \textrm_n\) .
Use linearity of expected value \[\begin \textrm(Y) & = \textrm(\textrm_1 + \textrm_2 + \cdots + \textrm_n)\\ & = \textrm(\textrm_1) + \textrm(\textrm_2) + \cdots + \textrm(\textrm_n)\\ & = \frac+ \frac+ \cdots \frac\\ & = n\left(\frac\right) = 1 \end\]

The answer to the previous problem is not an approximation: the expected value of the number of matches is equal to 1 for any \(n\) . We think that’s pretty amazing. (We’ll see some even more amazing results for this problem LATER.) Notice that we computed the expected value without first finding the distribution of \(Y\) .

Intuitively, if the rocks are placed in the spots uniformly at random, then the probability that rock \(i\) is placed in the correct spot should be the same for all the rocks, \(1/n\) . But you might have said: “but if rock 1 goes in spot 1, there are only \(n-1\) rocks that can go in spot 2, so the probability that rock 2 goes in spot to is \(1/(n-1)\) ”. That is true if rock 1 goes in spot 1. However, when computing the marginal probability that rock 2 goes in spot 2, we don’t know whether rock 1 went in spot 1 or not, so the probability needs to account for both cases. There is a difference between marginal/unconditional probability and conditional probability, which we will discuss in more detail LATER.

When a problem asks “find the expected number of…” it’s a good idea to try using indicator random variables and linearity of expected value.

Let \(A_1, A_2, \ldots, A_n\) be a collection of \(n\) events. Suppose event \(i\) occurs with marginal probability \(p_i\) . Let \(N = \textrm_ + \textrm_ + \cdots + \textrm_\) be the random variable which counts the number of the events in the collection which occur. Then the expected number of events that occur is the sum of the event probabilities. \[ \textrm(N) = \sum_^n p_i. \] If each event has the same probability, \(p_i \equiv p\) , then \(\textrm(N)\) is equal to \(np\) . These formulas for the expected number of events are true regardless of whether there is any association between the events (that is, regardless of whether the events are independent.)

Example 5.31 Kids wake up during the night. On any given night,

the probability that Paul wakes up is 1/14
the probability that Bob wakes up is 2/7
the probability that Tommy wakes up is 1/30
the probability that Chris wakes up is 1/2
the probability that Slim wakes up is 6/7.

If any kids wakes up they’re likely to wake other kids up too. Find the expected number of kids that wake up on any given night.

Simply add the probabilities: \(1/14 + 2/7 + 1/30+ 1/2 + 6/7=1.75\) . The expected number of kids to wake up in a night is 1.75. Over many nights, on average 1.75 kids wake up per night.

The fact that kids wake each other up implies that the events are not independent, but this is irrelevant here. Because of linearity of expected value, we only need to know the marginal probability 81 of each event (provided) in order to determine the expected number of events occur. (The distribution of the number of kids that wake up would depend the relationships between the events, but not the long run average value.)

5.6.3 Variance of linear combinations of random variables

Example 5.32 Consider a RV \(X\) with \(\textrm(X)=1\) . What is \(\textrm(2X)\) ?

Walt says: \(\textrm(2X) = 2^2\textrm(X) = 4(1) = 4\) .
Jesse says: \(\textrm(2X) = \textrm(X+X) = \textrm(X)+\textrm(X) = 1+1=2\) .

Walt is correctly using properties of linear rescaling. Jesse is assuming that a variance of a sum is the sum of the variances, which is not true in general. We’ll see why below.

When two variables are correlated the degree of the association will affect the variability of linear combinations of the two variables.

Example 5.33 Recall the Colab activity where you simulated pairs of SAT Math ( \(X\) ) and Reading ( \(Y\) ) scores from Bivariate Normal distributions with different correlations. (See also Section 3.9.) You considered the distribution of the sum \(T=X+Y\) and difference \(D= X - Y\) . Did changing the correction affect the variance of \(T\) ? Of \(D\) ?

Variance of sums and differences of random variables. \[\begin \textrm(X + Y) & = \textrm(X) + \textrm(Y) + 2\textrm(X, Y)\\ \textrm(X - Y) & = \textrm(X) + \textrm(Y) - 2\textrm(X, Y) \end\]

Example 5.34 Assume that SAT Math ( \(X\) ) and Reading ( \(Y\) ) follow a Bivariate Normal distribution, Math scores have mean 527 and standard deviation 107, and Reading scores have mean 533 and standard deviation 100. Compute \(\textrm(X + Y)\) and \(\textrm(X+Y)\) for each of the following correlations.

\(\textrm(X, Y) = 0.77\)
\(\textrm(X, Y) = 0.40\)
\(\textrm(X, Y) = 0\)
\(\textrm(X, Y) = -0.77\)

\(\textrm(X, Y) = \textrm(X, Y)\textrm(X)\textrm(Y) = 0.77(\)
\(\textrm(X, Y) = 0.40\)
\(\textrm(X, Y) = 0\)
\(\textrm(X, Y) = -0.77\)

If \(X\) and \(Y\) have a positive correlation:

Large values of \(X\) are associated with large values of \(Y\) (so the sum is really large), and small values of \(X\) with small values of \(Y\) (so the sum is really small), so the sum exhibits more variability than it would if the values of \(X\) and \(Y\) were uncorrelated.
Large values of \(X\) are associated with large values of \(Y\) (so the difference is small), and small values of \(X\) with small values of \(Y\) (so the difference is small), so the difference exhibits less variability than it would if the values of \(X\) and \(Y\) were uncorrelated.

If \(X\) and \(Y\) have a negative correlation:

Large values of \(X\) are associated with small values of \(Y\) (so the sum is moderate), and small values of \(X\) with large values of \(Y\) (so the sum is moderate), so the sum exhibits less variability than it would if the values of \(X\) and \(Y\) were uncorrelated.
Large values of \(X\) are associated with small values of \(Y\) (so the difference is large and positive), and small values of \(X\) with large values of \(Y\) (so the difference is large and negative), so the difference exhibits more variability than it would if the values of \(X\) and \(Y\) were uncorrelated.

The variance of the sum is the sum of the variances if and only if \(X\) and \(Y\) are uncorrelated. \[\begin \textrm(X+Y) & = \textrm(X) + \textrm(Y)\qquad \text\\ \textrm(X-Y) & = \textrm(X) + \textrm(Y)\qquad \text \end\]

5.6.4 Bilinearity of covariance

The formulas for variance of sums and differences are application of several more general properties of covariance. Let \(X,Y,U,V\) be random variables and \(a,b,c,d\) be non-random constants.

The variance of a random variable is the covariance of the random variable with itself.
Non-random constants don’t vary, so they can’t co-vary.
Adding non-random constants shifts the center of the joint distribution but does not affect variability.
Multiplying by non-random constants changes the scale and hence changes the degree of variability.
The last property is like a ``FOIL’’ (first, outer, inner, last) property.

The last two properties together are called bilinearity of covariance. These properties extend natural to sums involving more than two random variables. To compute the covariance between two sums of random variables, compute the covariance between each component random variable in the first sum and each component random variable in the second sum, and sum these covariances.

Example 5.35 Let \(X\) be the number of two-point field goals a basketball player makes in a game, \(Y\) the number of three point field goals made, and \(Z\) the number of free throws made (worth one point each). Assume \(X\) , \(Y\) , \(Z\) have standard deviations of 2.5, 3.7, 1.8, respectively, and \(\textrm(X,Y) = 0.1\) , \(\textrm(X, Z) = 0.3\) , \(\textrm(Y,Z) = -0.5\) .

Find the standard deviation of the number of fields goals in a game (not including free throws)
Find the standard deviation of total points scored on fields goals in a game (not including free throws)
Find the standard deviation of total points scored in a game.

If there were too much dependence, then the provided marginal probabilities might not be possible. For example if Slim always wakes up all the other kids, then the other marginal probabilities would have to be at least 6/7. So a specified set of marginal probabilities puts some limits on how much dependence there can be. This idea is similar to Example 1.10.↩︎