top of page ##### Chapter 1: Introduction to Probability

Introduction to the first two Chapters:

You are the product of a random universe. From the Big Bang to your own conception and birth, random events have determined who we are as a species, who you are as a person, and much of your experience to date. Ironic therefore that we are not well-tuned to understanding the randomness around us, perhaps because millions of years of evolution have cultivated our ability to see regularity, certainty and deterministic cause-and-effect in the events and environment about us. We are good at finding patterns in numbers and symbols, or relating the eating of certain plants with illness and others with a healthy meal. In many areas, such as mathematics or logic, we assume we know the results of certain processes with certainty (e.g., 2+3=5), though even these are often subject to assumed axioms. Most of the real world, however, from the biological sciences to quantum physics involves variability and uncertainty. For example, it is uncertain whether it will rain tomorrow; the price of a given stock a week from today is uncertain; the number of claims that a car insurance policy holder will make over a one-year period is uncertain. Uncertainty or "randomness" (i.e. variability of results) is usually due to some mixture of at least two factors including: (1) variability in populations consisting of animate or inanimate objects (e.g., people vary in size, weight, blood type etc.), and (2) variability in processes or phenomena (e.g., the random selection of 6 numbers from 49 in a lottery draw can lead to a very large number of different outcomes). Which of these would you use to describe the fluctuations in stock prices or currency exchange rates?

Variability and uncertainty in a system make it more difficult to plan or to make decisions without suitable tools. We cannot eliminate uncertainty but it is usually possible to describe, quantify and deal with variability and uncertainty using the theory of probability. This course develops both the mathematical theory and some of the applications of probability. The applications of this methodology are far-reaching, from finance to the life-sciences, from the analysis of computer algorithms to simulation of queues and networks or the spread of epidemics. Of course we do not have the time in this course to develop these applications in detail, but some of the end-of-chapter problems will give a hint of the extraordinary range of application of the mathematical theory of probability and statistics.

It seems logical to begin by defining probability. People have attempted to do this by giving definitions that reflect the uncertainty whether some specified outcome or "event" will occur in a given setting. The setting is often termed an "experiment" or "process" for the sake of discussion. We often consider simple "toy" examples: it is uncertain whether the number 2 will turn up when a 6-sided die is rolled. It is similarly uncertain whether the Canadian dollar will be higher tomorrow, relative to the U.S. dollar, than it is today. So one step in defining probability requires envisioning a random experiment with a number of possible outcomes. We refer to the set of all possible distinct outcomes to a random experiment as the sample space (usually denoted by S). Groups or sets of outcomes of possible interest, subsets of the sample space, we will call events. Then we might define probability in three different ways:

1.The classical definition: The probability of some event is

((number of ways the event can occur )/(number of outcomes in S)),

provided all points in the sample space S are equally likely. For example, when a die is rolled the probability of getting a 2 is (1/6) because one of the six faces is a 2.

2.The relative frequency definition: The probability of an event is the (limiting) proportion (or fraction) of times the event occurs in a very long series of repetitions of an experiment or process. For example, this definition could be used to argue that the probability of getting a 2 from a rolled die is (1/6).

3.The subjective probability definition: The probability of an event is a measure of how sure the person making the statement is that the event will happen. For example, after considering all available data, a weather forecaster might say that the probability of rain today is 30% or 0.3.

Unfortunately, all three of these definitions have serious limitations.

Classical Definition:    What does "equally likely" mean? This appears to use the concept of probability while trying to define it! We could remove the phrase "provided all outcomes are equally likely", but then the definition would clearly be unusable in many settings where the outcomes in S did not tend to occur equally often.

Relative Frequency Definition:    Since we can never repeat an experiment or process indefinitely, we can never know the probability of any event from the relative frequency definition. In many cases we can't even obtain a long series of repetitions due to time, cost, or other limitations. For example, the probability of rain today can't really be obtained by the relative frequency definition since today can't be repeated again under identical conditions. Intuitively, however, if a probability is correct, we expect it to be close to relative frequency, when the experiment is repeated many times.

Subjective Probability:    This definition gives no rational basis for people to agree on a right answer, and thus would disqualify probability as an objective science.  Are everyone's opinions equally valid or should we only consult "experts". There is some controversy about when, if ever, to use subjective probability except for personal decision-making but it does play a part in a branch of Statistics that is often called "Bayesian Statistics". This will not be discussed in Stat 230, but it is a common and useful method for updating subjective probabilities with objective experimental results. 0.2in

The difficulties in producing a satisfactory definition can be overcome by treating probability as a mathematical system defined by a set of axioms. We do not worry about the numerical values of probabilities until we consider a specific application. This is consistent with the way that other branches of mathematics are defined and then used in specific applications (e.g., the way calculus and real-valued functions are used to model and describe the physics of gravity and motion).

The mathematical approach that we will develop and use in the remaining chapters is based on the following description of a probability model:

•   a sample space of all possible outcomes of a random experiment is defined

•  a set of events, subsets of the sample space to which we can assign probabilities, is defined

•  a mechanism for assigning probabilities (numbers between 0 and 1) to events is specified.

Of course in a given run of the random experiment, a particular event may or may not occur.

In order to understand the material in these notes, you may need to review your understanding of basic counting arguments, elementary set theory as well as some of the important series that you have encountered in Calculus that provide a basis for some of the distributions discussed in these notes. In the next chapter, we begin a more mathematical description of probability theory.

Consider some phenomenon or process which is repeatable, at least in theory, and suppose that certain events or outcomes A₁,A₂,A₃,… are defined. We will often term the phenomenon or process an "experiment" and refer to a single repetition of the experiment as a "trial". The probability of an event A, denoted P(A), is a number between 0 and 1. For probability to be a useful mathematical concept, it should possess some other properties. For example, if our "experiment" consists of tossing a coin with two sides, Head and Tail, then we might wish to consider the two events A₁="Head turns up" and A₂="Tail turns up". It does not make much sense to allow P(A₁)=0.6 and P(A₂)=0.6, so that P(A₁)+P(A₂)>1. (Why is this so? Is there a fundamental reason or have we simply adopted 1 as a convenient scale?) To avoid this sort of thing we begin with the following definition.

Definition:  A sample space S is a set of distinct outcomes for an experiment or process, with the property that in a single trial, one and only one of these outcomes occurs.

The outcomes that make up the sample space may sometimes be called "sample points" or just "points" on occasion.  A sample space is defined as part of the probability model in a given setting but it is not necessarily uniquely defined, as the following example shows.

Example: Roll a 6-sided die, and define the events  ai=top face is i,  for i=1,2,3,4,5,6.

Then we could take the sample space as S={a₁,a₂,a₃,a₄,a₅,a₆}. (Note we use the curly brackets "{...}" to indicate the elements of a set). Instead of using this definition of the sample space we could instead define events with words:

E is the event that an even number turns up

O is the event that an odd number turns up

and take S={E,O}. Both sample spaces satisfy the definition. Which one we use would depends on what we wanted to use the probability model for. If we expect never to have to consider events like " a number less than 3 turns up" then the space S={E,O} will suffice, but in most cases, if possible, we choose sample points that are the smallest possible or "indivisible". Thus the first sample space is likely preferred in this example.

Sample spaces may be either discrete or non-discrete; S is discrete if it consists of a finite or countably infinite set of simple events. Recall that a countably infinite sequence is one that can be put in one-one correspondence with the positive integers, so for example {(1/2),(1/3),(1/4),(1/5),...}  is countably infinite as is the set of all rational numbers. The two sample spaces in the preceding example are discrete. A sample space S={1,2,3,…} consisting of all the positive integers is discrete, but a sample space S={x:x>0} consisting of all positive real numbers is not. For the next few chapters we consider only discrete sample spaces. For discrete sample spaces it is much easier to specify the class of events to which we may wish to assign probabilities; we will allow all possible subsets of the sample space. For example if S={a₁,a₂,a₃,a₄,a₅,a₆} is the sample space then A={a₁,a₂,a₃,a₄} and B={a₆}  and S  itself are all examples of events.

Definition An event in a discrete sample space is a subset A⊂S. If the event is indivisible so it contains only one point, e.g. A₁={a₁} we call it a simple event. An event A made up of two or more simple events such as A={a₁,a₂} is called a compound event.

Our notation will often not distinguish between the point a₁ and the simple event A₁={a₁} which has this point as its only element, although they differ as mathematical objects. When we mean the probability of the event A₁={a₁}, we should write P(A₁) or P({a₁}) but the latter is often shortened to P(a₁). In the case of a discrete sample space it is easy to specify probabilities of events since they are determined by the probabilities of simple events.

Definition:  Let S={a₁,a₂,a₃,…} be a discrete sample space. Then probabilities P(a₁) are numbers attached to the a₁'s (i=1,2,3,…) such that the following two conditions hold:

(1) 0≤P(a₁)

(2) ∑P(a₁)=1

The above function P(∗) on S which describes the set of probabilities {P(a₁), i=1,2,…} is called a probability distribution on S. The condition ∑P(a₁)=1 above reflects the idea that when the process or experiment happens, one or other of the simple events {a₁) in S must occur (recall that the sample space includes all possible outcomes). The probability of a more general event A (not necessarily a simple event) is then defined as follows:

Definition:  The probability P(A) of an event A is the sum of the probabilities for all the simple events that make up A  or  P(A)=∑_P(a)  over all {a∈A}.

For example, the probability of the compound event A={a₁,a₂,a₃} is P(a₁)+P(a₂)+P(a₃). Probability theory does not say what numbers to assign to the simple events for a given application, only those properties guaranteeing mathematical consistency. In an actual application of a probability model, we try to specify numerical values of the probabilities that are more or less consistent with the frequencies of events when the experiment is repeated. In other words we try to specify probabilities that are consistent with the real world. There is nothing mathematically wrong with a probability model for a toss of a coin that specifies that the probability of heads is zero, except that it likely won't agree with the frequencies we obtain when the experiment is repeated.

Example: Suppose a 6-sided die is rolled, and let the sample space be S={1,2,3,4,5,6}, where 1 means the top face is 1, and so on. If the die is an ordinary one, (a fair die) we would likely define probabilities as

P(i)=1/6 for i=1,2,3,4,5,6,

because if the die were tossed repeatedly by a fair roller (as in some games or gambling situations) then each number would occur close to 1/6 of the time. However, if the die were weighted in some way, or if the roller were able to manipulate the die so that 1 is more likely, these numerical values would not be so useful. To have a useful mathematical model, some degree of compromise or approximation is usually required. Is it likely that the die or the roller are perfectly "fair"?  Given P(i), if we wish to consider some compound event, the probability is easily obtained. For example, if A="even number obtains" then because A={2,4,6} we get P(A)=P(2)+P(4)+P(6)=1/2.

We now consider some additional examples, starting with some simple "toy" problems involving cards, coins and dice. Once again, to calculate probability for discrete sample spaces, we usually approach a given problem using three steps:

(1) Specify a sample space S.

(2) Assign numerical probabilities to the simple events in S.

(3) For any compound event A, find P(A) by adding the probabilities of all the simple events that make up A.

Later we will discover that having a detailed specification or list of the elements of the sample space may be difficult. Indeed in many cases the sample space is so large that at best we can describe it in words. For the present we will solve problems that are stated as "Find the probability that ..." by carrying out step (2) above, assigning probabilities that we expect should reflect the long run relative frequencies of the simple events in repeated trials, and then summing these probabilities to obtain P(A).

bottom of page