xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点

微信客服:xiaoxionga100

微信客服:ITCS521
COMP11120 Mathematical Techniques for Computer Science Chapters 0–2, 4–5 Andrea Schalk A.Schalk@manchester.ac.uk 30th August 2021 1 Contents 0 Basics 4 0.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 0.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 0.4 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 1 Complex Numbers 51 1.1 Basic denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 1.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 1.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2 Statements and Proofs 64 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.3 Properties of numbers . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4 Properties of Sets and their Operations . . . . . . . . . . . . . . . 82 2.5 Properties of Operations . . . . . . . . . . . . . . . . . . . . . . . 82 2.6 Properties of functions . . . . . . . . . . . . . . . . . . . . . . . . 93 3 Formal Logic Systems 117 4 Probability Theory 118 4.1 Analysing probability questions . . . . . . . . . . . . . . . . . . . 118 4.2 Axioms for probability . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 Conditional probabilities and independence . . . . . . . . . . . . . 158 4.4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 4.5 Averages for algorithms . . . . . . . . . . . . . . . . . . . . . . . . 230 4.6 Some selected well-studied distributions . . . . . . . . . . . . . . 238 5 Comparing sets and functions 244 5.1 Comparing functions . . . . . . . . . . . . . . . . . . . . . . . . . 244 5.2 Comparing sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Glossary 261 Exercise Sheets 272 2 6 Recursion and Induction 269 6.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 6.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 6.3 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 6.4 The natural numbers . . . . . . . . . . . . . . . . . . . . . . . . . 310 6.5 Further properties with inductive proofs . . . . . . . . . . . . . . 330 6.6 More on induction . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 7 Relations 332 7.1 General relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 7.2 Partial functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 7.3 Binary relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 7.4 Equivalence relations . . . . . . . . . . . . . . . . . . . . . . . . . 347 7.5 Partial orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 8 Applications to Computer Science 424 3 Chapter 0 Basics This chapter explains some concepts most of which you should have encountered before coming to university; but you may not have been given formal descriptions previously. These notions give us a starting point so that we have examples for the formal development that follows from Chapter 1 onwards, but note that some of the concepts and properties that appear in this chapter are put on a formal footing subsequently. Whenever you nd concepts used in the notes that have familiar names you should check this chapter to ensure that you only use the fact provided here. There will be no lectures about the material in this chapter, but the examples classes in Week 1 are there to make sure you understand the ideas and the notation used here. Note that there is a universally accepted language described here that you will also encounter in other course units. Note that we here assume that certain collections of numbers, with various operations, have already been dened. You will see formal denitions of most of these (real numbers being the exception) in Chapter 6 which we will study in Semester 2. The purpose of assuming they are present at the start is to allow us to use them as examples. 0.1 Numbers Naively speaking, numbers are entities we often use when we wish to calculate something. Mathematically speaking, there is typically rather more going on: Numbers are sets with operations, and these operations have particular properties. Many of these properties are named and studied in Chapter 2. 0.1.1 Natural numbers The natural numbers are often also referred to as counting numbers, and the collection of all of them is typically written as N. For the time being we assume that you know what these numbers are; a formal denition appears as Denition 50 in Chapter 6. Foreshadowing the formal denition, we point out that simplest way of form- ally describing the natural numbers is to say that • there is a natural number 0 and • given a natural number there is another natural number , the successor of , more usually written as + 1. 4 Every natural number can be generated in this way, although to reach 123456, for example, one has to apply the successor operation quite a few times! This also means that given a natural number , we know that one of the following is the case: • either = 0 or • there exists a natural number with = (or, if you prefer, = +1). This might seem like a trivial observation, but it is the basis of using the concept of recursion to dene properties or functions for the natural numbers, and also for being able to prove properties by induction. This is described in detail in Section 6.4 of these notes. Here we look at the informal notions you have met at school. With the natural numbers come some operations we use; their properties are given below. • Given natural numbers and we can add1 these to get +. • Given natural numbers and we can multiply2 these to get · . You are allowed to use the following about natural numbers, except in Sec- tion 6.4 where we prove many of these facts formally. Fact 1 Given , , and in N we have3 + = + commutativity of + (+ ) + = + ( + ) associativity of + + 0 = = 0 + 0 unit for + . For the same variables we also have4 · = · commutativity of · ( · ) · = · ( · ) associativity of · · 1 = = 1 · 1 unit for · . For the same variables we also have the property · ( + ) = · + · · distributes over + . For the same variables we also have5 + = + implies = . 1A formal denition of addition appears in Example 6.31. 2A formal denition of this operation appears in Example 6.36. 5 A mathematician might say that the natural numbers form a commutative monoid with unit 0 when looking at the addition operation, and a commutative monoid with unit 1 when looking at multiplication. In Section 2.5 we look formally at the properties given by these equalities. There is one additional property we require. The following is used in Euclid’s algorithm, see Example 6.42, but also to dene integer division, see below, which appears in Chapter 2. Fact 2 Given in N and in N with ̸= 0 there exist unique numbers and in N such that • 0 ≤ < and • = + . We use this fact to dene a division operation on natural numbers, known as integer division6. We dene7 div to be the unique number in N in Fact 2. This is the number of times divides (leaving a remainder). We dene the remainder for integer division by setting mod to be the unique from Fact 2. This is the remainder leaves when divided by . See Code Examples 0.1 and 0.2 to see how these operations are implemented in Python and Java. Example 0.1. For example, we have that 5 div 2 = 2 and 5 mod 2 = 1 7 div 3 = 2 and 7 mod 3 = 1 9 div 3 = 3 and 9 mod 3 = 0 11 div 4 = 2 and 11 mod 4 = 3. Example 0.2. We look at two particular cases to see the patterns which develop. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 mod 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 div 3 0 0 0 1 1 1 2 2 2 3 3 3 4 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 mod 7 0 1 2 3 4 5 6 0 1 2 3 4 5 6 div 7 0 0 0 0 0 0 0 1 1 1 1 1 1 1 3Formal proofs of these properties appear in Example 6.35 as well as Exercise 133. 4Formal proofs of these as well as the nal property are given in Exercise 135. 5Note that we cannot subtract within the natural numbers, so this property gives us the strongest statement we have. Mathematicians would say that addition is right (and also left) cancellable. 6Sometimes also called Euclidean division. 7See the following section to see that this idea can be extended to the integers. 6 Example 0.3. Note that it is not necessarily the case that · ( div ) = , for example 2 · (3 div 2) = 2 · 1 = 2 ̸= 3. This is dierent from the way of dividing numbers you may be used to8 and that is the reason that this kind of division has a dierent name, and a dierent symbol. Lemma 0.1 For all natural numbers and , where ̸= 0, we have = · ( div ) + ( mod ). Exercise 1. Give an argument that Lemma 0.1 is valid using Fact 2. Denition 1: divisible Given natural numbers ̸= 0 and , is divisible by or that divides if and only if there exists a natural number such that · = . Note that divides if and only if it is the case that mod = 0. Denition 2: even/odd An natural number is even if and only if is divisible by 2. Such a number is odd if and only if it is not divisible by 2. This means that is even if and only if mod 2 = 0, and that is odd if and only if mod 2 = 1. Note in particular that 0 is an even number. We might also want to think about which equations we can solve in the natural numbers. Assume that and are elements of N. For example, we can solve + = , 8See for example the discussion in the Section 0.1.3 on rational numbers below. 7 within N, provided that9 is less than or equal to , which we write as ≤ . We can also solve = , within N provided that mod = 0. Because of the side conditions required we see that a lot of equations we can write down using the available operations do not have a solution. We can use the natural numbers to count something, for example the number of instructions in a computer program, or the number of times a program will carry out the body of a loop. This is important to do when we are trying to estimate how long it may take a program to run on a large-size problem. There are a lot of natural numbers, namely innitely many. But by mathem- atical standards the natural numbers are the smallest innite set, and there are substantially larger ones. Sets of this size are set to be countably innite. This is formally dened in Section 5.2. Computer languages do typically not implement the natural numbers—instead, a programming language will have support for all natural numbers up to a par- ticular maximum. Nothing truly innite can be implemented in any real-world computer (but there are theoretical computation devices which have innite stor- age). Quite often programming languages have a built-in type for integers instead of natural numbers, as is the case with Python and Java. 0.1.2 Integers A simple way of explaining the integers is that one wants to expand the natural numbers in order to make it possible for every number to have an inverse with respect to addition, that is, for every number there is a number , usually written as −, with the property that + = 0 = + . Dening the integers formally in a way that supports the above idea is quite tricky. Such a description is given in Chapter 7, see Denition 58. It’s fairly easy to describe the elements of this set, called10 Z, once one has the natural numbers, since one can11 say Z = N ∪ {− | in N, ̸= 0}, but this does not tell us anything about how to calculate with these numbers. So this does not, mathematically speaking, dene the integers with all the operations we customarily use for them. The absolute, ||, of an integer is dened to be12 • if is greater than or equal to 0 and • − if is less than 0. 9The solution to such an equation would have to satisfy = − and this is not always dened. 10The notation Z for the set of integers is very common within mathematics, the letter coming from the German word ‘Zahlen’, or numbers. You may know this set under a dierent name, but that should not worry you. 11The following expression uses symbols explained in detail in Section 0.2. 12See Example 0.32 for a denition of this as a function, although that denition is for real numbers. 8 We (very rarely) use Z+ to refer to those integers13 which are greater than or equal to 0. Fact 3 The equalities from Fact 1 also hold14 if the variables are elements of Z. We have an additional property, namely, for every in Z there exists a unique in Z with + = 0 = + . We say that this number is the additive inverse for with respect to addi- tion. The number − is dened to be the additive inverse of . A mathematician would say that Z forms a commutative ring with multiplicative unit 1. Many people use subtraction as an operation. However, it is much preferable to think of this not as an operation, but as − being a shortcut for adding the additive inverse of , −, to —in other words, this is merely a shortcut for + (−). Please do not talk about subtraction on this course unit, but about adding additive inverses. There are many many situations in mathematics where not all inverses exist,15 and so you should pause to think whether the operation you wish to carry out is legal. Fact 2 changes a bit when we use it for integers. Fact 4 Given in Z and in Z with ̸= 0 there exist unique numbers and in Z such that • 0 ≤ < || and • = + . Hence we may extend the denitions of the operations of mod and div, that come from integer division for natural numbers as dened above, to the integers. In other words, for integers and , • div is the unique , and • mod is the unique , from the above fact. 13This set of numbers is, of course, equivalent to N 14The formal proof that addition satises these properties appears in Section 7.4.7 and Exercise 168 provides proof that multiplication satises them . 15For example, for the rational, real and complex (see Chapter 1) numbers, the number 0 has no multiplicative inverse. When you study matrices you will see that very few matrices have multiplicative inverses. 9 Example 0.4. We have that −5 div 2 = −3 and −5 mod 2 = 1 7 div−3 = −2 and 7 mod −3 = 1 9 div−3 = −3 and 9 mod −3 = 0 −11 div 4 = −3 and −11 mod 4 = 1. Example 0.5. Once again we look at two particular cases to see the patterns which develop. −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 mod 3 1 2 0 1 2 0 1 2 0 1 2 0 1 div 3 −2 −2 −1 −1 −1 0 0 0 1 1 1 2 2 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 mod 7 2 3 4 5 6 0 1 2 3 4 5 6 0 div 7 −1 −1 −1 −1 −1 0 0 0 0 0 0 0 1 Lemma 0.2 For all integers , and all integers ̸= 0, we have = · ( div ) + ( mod ). The notions of evenness and oddness transfer with the same denitions as for natural numbers. Indeed, the denitions given below can be applied to natural numbers viewed as integers, and they will give the same result as the corresponding denition from the previous section. Denition 3: divisible Given integers ̸= 0 and we say that is divisible by or that divides if and only if there exists an integer such that · = . Note that divides if and only if mod = 0. Exercise 2. Use Fact 4 and the formal denition of divisibility and mod to argue that the previous sentence is correct. Denition 4: even/odd An integer is even if and only if is divisible by 2. Such a number is odd if and only if it is not divisible by 2. Exercise 3. Use Fact 4 and the formal denition of mod to argue that a natural number is even if and only if mod 2 = 0, and odd if and only if mod 2 = 1. How do the even numbers relate to those numbers which are a multiple of 2? Can you make your answer formal? Do your answers change if is an 10 integer? The fact that every number has an additive inverse means that for and in Z we can solve all equations of the form + = within Z without reservations. Indeed, every equation in one variable which involves addition and additive inverses has a unique solution. On the other hand, equations of the form = are still not all16 solvable within Z. If we accept that there are innitely many natural numbers then it is clear that there are also innitely many integers. Because the natural numbers are embedded inside the integers one might assume that there are more of the latter, but actually, this is not a sensible notion of size for sets. Mathematically speaking, N and Z have the same size, see Section 5.2 for details of what that statement means. Many programming languages support a data type for the integers. However, only nitely many of them are represented. In Pythonor Java, for example, integers are given by the primitive type int, and range in Java they range from −231 to 231 − 1. In Python there is a type long of long integers which are integers of unlimited size. Code Example 0.1. In Python there is an implementation of integer division. However, it does not implement our denition when faced with negative numbers. There are Python commands n // m and n%m with the property that m x (n//m) + (n%m) = m However the implementation does not force n%m to be non-negative, and so if you use the Python commands to play with integer division you will see results that are misleading as far as the underlying mathematics is concerned. Also see the following example for Java showing that programmers prefer to implement something dierent from the mathematicians’ denition. Code Example 0.2. In Java integer division is also implemented. Here is a procedure that returns the result of dividing by (as integers). public static int intdiv (int n, int m) { return n/m; } Similarly there is an implementation of the remainder of dividing by . public static int intmod (int n, int m) { return n % m; } 16The solution would have to satisfy = /, and this is not dened for all and . 11 Note, however, that this does not return the numbers that appears in our denition: If is negative then n%m is a negative number. The way Java implements the two operations ensures that they satisfy Lemma 0.2, that is n = m*(n/m) + n%m. The result of the Java expression n%m is ‘equivalent modulo ’ to the result of mod , see Section 7.4.5. This means that for negative you can get mod by adding to n%m. In the programming language C the language specication does not state what the smallest and greatest possible integers are—dierent compilers have dierent implementations here. You have to work out what is safe to use in your system. 0.1.3 Rational numbers One can view the rational numbers, usually written17 as Q, as the numbers re- quired if one wants to have a multiplicative inverse for every number other than 0. But again, giving a formal denition of these numbers is not straightforward if one wants to ensure that all the previous operations are available. One way of talking about the rational numbers is to introduce the notion of a fraction, written as /, where and are integers. But we cannot dene the rational numbers to be the collection of all fractions since several fractions may describe the same rational number: We expect 2/4 to describe the same number as 1/2 Formally we have to dene a notion of equality (or equivalence) on fractions, whereby / = ′/′ if and only if ′ = ′. There is a formal denition of the rational numbers, and their addition and multi- plication, in Chapter 7, see Denition 59. We have quite a bit of structure on Q. All the facts for integers still hold, but we get a new property.18 Fact 5 The statements from Fact 3 remain true if all variables are taken to be elements of Q. In addition, for all in Q with ̸= 0 there exists in Q such that · = 1 = · . We say that is the multiplicative inverse for . Every element ̸= 0 has a multiplicative inverse and the standard notation for this element is −1. A mathematician would say that Q with addition and multiplication is a eld. 17The name comes from the Italian ‘quoziente’, quotient. We look at why this is in Semester 2, see Section 7.4. 18Exercises 168 and 169 provide formal proofs of most of these properties. 12 In my experience many students do not worry suciently about potentially dividing by 0—Fact 5 makes it clear that only for numbers unequal to 0 are we allowed to divide. In a recent exam paper a number of students reasoned that = ′ and = ′′ imply that = ′, but in making this claim they neglected the case where = ′ = 0, which makes that conclusion false. Many people speak of division as an operation on rational (and real) numbers, but again, this is merely a shortcut: Writing / is an instruction to multiply with the multiplicative inverse of , that is, it is a shortcut for · −1. The number 0 does not have a multiplicative inverse, and that is why division by 0 is not allowed. In this course unit, please try not refer to division as an operation, and when you multiply with inverses, always check to ensure these exist. Exercise 4. What properties would a multiplicative inverse for 0 have to satisfy? Argue that one such cannot exist. The notion of the absolute can be extended to cover the rationals, using the same denition. Given and ′ in Q we can now solve all equations of the form + = ′ and = ′ (if ̸= 0) within Q, provided that,19 for the second equation, ̸= 0. And indeed, every equation with one unknown involving addition, multiplication and inverses for these operations is solvable provided that it is not equivalent to one of the form = ′ where = 0, ′ ̸= 0. The rational numbers are sucient for a number of practical purposes; for example, to measure the length, area, and volume of something to any given precision, and also to do calculations with such quantities. There are innitely many rational numbers, but mathematically speaking, Q has the same size as N. See Section 5.2 for how to compare the size of sets. Most mainstream programming languages do not have a datatype for the rationals (or for fractions), but those aimed at algebraic computations (such as Mathematica and Matlab) do. 19Note how the restrictions we have to make on equations to ensure they are solvable connect with where the operations involved are dened (or not) for the various sets of numbers discussed here. 13 0.1.4 Real numbers The rational numbers allow us to measure anything up to arbitrary precision, we may add and subtract them, and there are additive and multiplicative inverses (the latter with the exception of 0), which allows us to solve many equations. Why do we need a larger set of numbers? There are several approaches to this question. Here we give two. If we look at the rational numbers drawn on a line then there are a lot of gaps. Mathematically speaking we may dene a sequence (of rational numbers), that is a list of numbers in Q, one for each ∈ N. Sometimes a sequence can be said to converge to a number, that is, the sequence gets arbitrarily close to the given number and never moves away from it.20 If such a number exists it is called the limit of the sequence. For example, the limit of 1, 1/2, 1/4, 1/8, . . . that is 1/2 for in N is 0. Let us consider the sequence dened as follows: 0 = 1 +1 = 2 + 2 2 We may calculate the rst few members of the sequence to get 1, 3/2, 17/12, 577/408, . . . and, if expressed in decimal notation, 1, 1.5, 1.416, 1.4142568627451, . . . One may show that 2 gets closer and closer to 2, so we may think of the above as approximating a number with the property that 2 = 2. Optional Exercise 1. Show that there is no rational number with the property that 2 = 2. Hint: Assume that you have = / for some natural numbers and and derive a contradiction. Hence there are numbers that are approximated by sequences of rational numbers which are not themselves rational. Or, if we draw the rational numbers as a line then it has a lot of gaps. One can dene the notion of a Cauchy sequence. One may think of this as a sequence that should have a limit (because the sequence contracts to a smaller and smaller part of the rational numbers), but where there is no suitable rational number for it to converge to. One can dene the real numbers R as being all the limits for all the Cauchy sequences one can build from the rationals. This gives a ‘complete’ set of numbers in the sense that every Cauchy sequence built from 20This can be dened mathematically but would take up more space than we want to give it here. 14 elements of R has a limit in R. We use R+ to refer to those real numbers which are greater than or equal to 0. The numbers in R which are not in Q are known as the irrational numbers. We do not give a formal denition of the real numbers in these notes—the above outline should convince you that this is reasonably complicated to do rigorously. We may think of the rational numbers as being included in R. The real numbers again come with the operations of addition and multiplication, and inverses for these (but 0 still does not have a multiplicative inverse), and we again have the previous distributivity law for these operations. Just like the rational numbers, the reals with these operations form a eld, see Fact 6. Fact 6 All statements from Fact 5 remain true if the variables are taken to be elements of R. A mathematician would say that the real numbers, with addition and multiplication, also form a eld. All the sets of numbers discussed so far are ordered, that is, given two numbers we may compare them. See Section 7.5.1 on how one generally talks about this idea. Here we are concerned with giving additional facts you may want to use in solving exercises. The denition of the absolute again transfers to this larger set of numbers. Fact 7 Let , ′, and ′ be elements of R. Then the following hold: For all , ′ in R we have ≤ ′ or ′ ≤ . If ≤ ′ and ≤ ′ then + ≤ ′ + ′. If ≤ ′ and ≥ 0 then · ≤ ′ · . If ≤ ′ and ≤ 0 then · ≥ ′ · . If ≤ then − ≥ −. If ≤ < 0 or 0 < ≤ then −1 ≥ −1 If ≥ 1 and ≤ ′ then ≤ ′ . If > 1, , ′ > 0 and ≤ ′ then log ≤ log ′. An alternative approach to introducing numbers beyond the rationals is as follows. Within the rational numbers we are able to solve all ‘sensible’ equations in one variable involving addition, multiplication and their inverses with rational numbers. We may even add multiples of that variable with each other.21 But we may not multiply the unknown with itself: Equations of the form = or 2 = are not all solvable with Q. By moving from Q to R we add a lot of solutions to such equations to our set of numbers. For example, all equations of the form = 21These equations are called linear. 15 are solvable for in N and in R with ≥ 0. Indeed, we may replace in N by in Q and we still have solutions.22 The situation becomes quite complicated. First of all we dene a new symbol: We write ∑︁ =0 = + −1−1 + · · ·+ 1+ 0 for a sum of a nite number of elements.23 Given a polynomial equation, that is one of the form ∑︁ =0 = + −1−1 + · · ·+ 1+ 0 = 0 where in Q for 0 ≤ ≤ , there may be up to dierent solutions, or there may be none at all. Those real numbers that are solutions to such polynomial equations are known as algebraic numbers. Examples are √ 2, 5 √ 17 and 3 √︀ 3/2. But not all elements of R can be written as solutions to such equations. Those that can not are the transcendental numbers; famous examples are and , and less well-known ones and 2 √ 2. We can therefore not24 use the idea that R arises from Q by adding solutions to equations over Q to formally dene R. Real numbers are often referred to using decimal expansions. Such an expansion is given by an integer together with a sequence of digits (one digit for each natural number). For the integer 0 for example one gets numbers typically written 0.123 . . . , where for in N we know that in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. It is common not to write trailing 0s (that is 0s where there is no dierent digit occurring to the right), so we write 3.14 instead of 3.140 = 3.14000000 . . .. Note, however, that a number may have more than one decimal expansion, and 0.9 = 0.9999999 . . . refers to the same number as 1.0000000 . . . = 1.0 = 1. Exercise 5. If we change base from 10 we can still express numbers using pre- and post-decimal digits. This question asks you to think a little bit about this. (a) Translate the number 1.1 in base 2 to base 10. (b) Translate the number 1.75 in base 10 to base 2. (c) Give an alternative representation for the number 1.0 in base 2. We don’t really need real numbers in the ‘real world’, but a lot of what we might want to describe becomes a lot smoother if we are allowed to use them (the trajectory of a ball is much easier thought of as a line than a sequence of points with rational coordinates), and they allow us to be precise when referring to the circumference of a circle, for example. 22And we may even replace in by ′ in R and use the idea of the continuity of a function to dene the operation of forming to the power of ′, and we can still nd solutions. 23This idea is formally introduced on page 6.45 in Chapter 6 but we use the ∑︁ symbol in Chapter 4 as well. 24There is a way of algebraically dening the real numbers, but that requires a lot of mathematical theory to be set up that is fairly advanced. 16 The set R is innite in size—but mathematically speaking, it is strictly larger than Q. It is uncountably innite. See Section 5.2 for more details. No real-world computer can implement all the real numbers. This is no sur- prise given that there are innitely many of them. But more importantly every implementation of (some of) the real numbers will only allow limited precision.25 Programming languages typically have some kind of oating point number type to approximate real (and so also) rational numbers, such as float in Python or Java. It’s not unusual for there to be a more precise type, such as double in Java. Note that operations on such numbers typically incur rounding errors (for these operations to be precise it would be necessary to change the range of numbers which are representable by adding more digits—for example, .5/2.0 = .25, and we need to go from 1 digit after the decimal point to 2). In Java there is also bignum which allows for arbitrary precision (since the maximal allowable length of the number can be extended), provided the number has a nite decimal expansion, but these come at a price in memory and time performance (and a program that keeps adding digits will eventually run out of memory). Floating point numbers are given by a signicand and an exponent (because this increases the range of numbers that can be represented), where for a given base, the number described is signicand× baseexponent. 0.1.5 Numbers We typically think of the sets of numbers introduced here as being subsets of each other, with N ⊆ Z ⊆ Q ⊆ R. Mathematically speaking, this is not strictly correct, but instead we have a function that embeds the integers, say, in the rationals, in such a way that carrying out operations from the integers also works if we think of the numbers as rationals. See Section 7.4.7 for a formal denition of the integers and the rational numbers. Sometimes in these notes we do not want to specify which set of numbers we mean, and then we assume there is a set with N ⊆ ⊆ R with an addition and a multiplication operation satisfying Fact 1. Note that we can use the equalities given in the various Facts about sets of numbers to show general properties without knowing which set of numbers we are referring to. Example 0.6. In this example we show that it is possible to establish facts about numbers just from the general properties given in the various Facts above. Let be a set of numbers from Z, Q or R. Then by one of Facts 3, 5 or 6 we have for all , and in the distributivity law ( + ) = + . Further by the same fact we know that there exists a number 0 in with the 25There are some languages where it is possible to carry out calculations to a pre-dened precision, but these are not main stream, and signicant overhead is required to make this work properly. 17 property that for all in we have 0 + = = + 0, and for all in we have an additive inverse for , in with + = 0 = + , and the associativity law. Together these tell us that for all in we have · 0 = · (0 + 0) 0 unit for + = · 0 + · 0 distr law, and if we set to be the additive inverse of · 0 we may conclude from the previous equality, by adding on both sides, that 0 = 0 · + add inverse for 0 · = (0 · + 0 · ) + prev equality = 0 · + (0 · + ) associativity law = 0 · + 0 add inverse for 0 · = 0 · 0 unit for + . Of course you have known for a very long time that for all those sets of numbers, multiplying 0 with any other number gives 0 once again. But have you ever wondered whether there is a good mathematical reason for that fact? The answer is that addition and multiplication have general properties that force this equality upon us.x More powerfully, if we have any set with operations we may call + and · which satisfy the given equalities we can show that multiplying any element with the unit for addition has to again be the unit for addition. We look at the general properties of operations in Section 2.5. 0.2 Sets Sets are very important in mathematics—indeed, modern mathematics is built entirely around the notion of sets. A set is a collection of items. Collections are required in order to • make it clear what one is talking about (ruling some things in and others out); • precisely dene various collections of numbers—and in general, much of algebra is concerned with structures given by – an underlying set (for examples see Section 0.1 for various sets of numbers which, however, aren’t formally dened here), – operations on the set (such as addition and multiplication, for various collections of numbers) and – the properties of these operations (see for example Facts 1, 3, 5 and 6).26 26These properties are studied in more detail in Section 2.5. 18 • dene functions (see following section)—instructions for turning entities of one kind into entities of another. Sets have members and indeed a set is given by describing all the members it contains. We write ∈ if is an member of the set , for example ∈ R or ∈ {, , }. Members are often also referred to as elements. There is a set that contains no elements at all, the empty set, ∅. 0.2.1 Sets Deciding which collections of entities may be considered sets is not as easy as it might sound. Originally mathematicians thought that there would not be any problems in allowing any collection to be considered a set, but very early into the 20th century Bertrand Russell described the paradox named after him: If we are allowed to form the set of all sets which do not contain themselves as members then we have a contradiction.27 Theories that contain contradictions are called inconsistent, and they are not very useful since (at least according to classical logic) every statement may be deduced in an inconsistent theory. But if every statement is valid then the theory is of no use. This caused something of a crisis, and prompted the creation of set theory as a eld within mathematics. Set theory is concerned with the question of how sets may be built in a way that does not lead to contradictions. Mathematicians need to build fairly complicated sets, and making sure that all their constructions are allowed in the underlying set theory is not easy. The sets we require on this course unit are nothing like as complicated and so we do not have to worry about proper set theory here (and you should not refer to what is described in this section as ‘set theory’). 0.2.2 Operations on sets The most fundamental operations on sets we may use is to compare28 them. Denition 5: subset, superset A set is a subset of the set , written ⊆ , if and only if every member of is also an member of . In this situation we say that is a subset of . In this situation we have ∈ implies ∈ , or for all ∈ ∈ . 27Ask yourself whether the given ‘set’ contains itself. 28A more general notion of comparisons between sets is studied in Section 5.2. 19 Note that the usage of key phrases such as ‘implies’, ‘there exists’, ‘for all’ is described in detail in Chapter 2.2.1. If ⊆ and ⊆ then = because they contain precisely the same members. We often dene subsets of sets we already know by identifying some particular property. The notation used for this is { ∈ | has property }. This notation is explained in more detail in Section 0.2.3. Denition 6: proper subset We say that a set is a proper subset of the set if and only if • is a subset of , that is ⊆ and • there exists a member ∈ with /∈ , Sometimes the notations ⊂ or ( are used in this situation. When we have sets we may build new sets by putting their combined members into one set, or by considering only those members contained in both sets. Because constructing new sets is non-trivial and may lead to problems it is usually better rst to nd an ‘ambient’ set that contains both the given sets. Given a set , for and subsets of , we dene • their union, ∪ , to be { ∈ | ∈ or ∈ }, which means that ∈ ∪ if and only if ∈ or ∈ ; • their intersection, ∩ , to be { ∈ | ∈ and ∈ }, which means that ∈ ∩ if and only if ∈ and ∈ . Note that we may now dene the union or intersection of nitely many subsets of by applying the operation to two sets at a time, that is, for example, 1 ∪ 2 ∪ 3 ∪ · · · ∪ = (· · · ((1 ∪ 2) ∪ 3) ∪ · · · ∪ ). Again we have used · · · here, and to be more precise we should adopt the mathematical notation ⋃︁ =1 20 instead, which spells out that we are forming the union of all the sets from 1 to . But in fact, given an arbitrary collection of subsets of we may dene their union and their intersection to obtain another subset of . Let be a subset of for each ∈ , where is an arbitrary set. Then⋃︁ ∈ = { ∈ | there is ∈ with ∈ } and ⋂︁ ∈ = { ∈ | for all ∈ we have ∈ }. We say that the union of a family of sets is disjointdisjoint if and only if the sets whose union we are forming do not overlap. It is sometimes useful to draw such constructions in the form of a Venn dia- gram. This is a picture of a generic set. The union of two generic sets, and , can then be drawn as follows. But this is a bit imprecise if we do not draw the boundaries of the sets, so it is more common to draw the boundaries of all the sets involved. We assume we have a set , here shown in red,29 and a set , here shown in blue for which we form the union ∪ (here in purple). ∪ 29You will see the colours only in the electronic but not in the printed version. 21 The picture for the intersection, again drawn in purple. ∩ Sometimes we care about the fact that two sets do not overlap. Denition 7: disjoint We say that two sets and are disjoint if and only if it is the case that ∩ = ∅. There is one further important operation on sets. Denition 8: complement relative to Let be a subset of a set . The complement of relative to , ∖ , is given by { ∈ | /∈ }. Some people write − for this set, and some people write ′ or . The latter two require that it is clearly understood which ambient set (here ) is meant. It has the advantage that some properties can be formulated very concisely in that notation. We do not use the primed version for complement in these notes—instead, we use it to give us variable names (so , ′ and ′′ might be names for dierent sets). Some of you have been taught that it is safe to write for the complement of a set because we somehow know in which set we are taking the complement. Always make it clear where you are taking your complements. If we want to draw the complement then we have to draw the ambient set . (We didn’t have to do this for the examples so far.30) We do this by drawing a square, with living inside the square. ∖ These are all the operations required to build new sets from given ones. It is now possible, for example, to dene the set dierence, ∖ , of all members of that do not belong to , ∖ = { ∈ | /∈ } = ∩ ( ∖ ), drawn in purple below. 30We could have drawn a box around the diagrams given above, but this doesn’t really add anything. 22 ∖ Example 0.7. We can use these operations to give names relative to and to all the regions in the following picture. This means we know how to determine the elements of all these regions, provided we know when an element is in , and when it is in . ∖ ( ∪ ) ∩ ∖ ∖ If we have more than two sets to start with then there are many more sets one could describe, but we now have to tools to do so for all of them. Exercise 6. Identify all regions in the above picture and give their description based on operations applied to , , and . Proofs involving sets are often quite simple. We give an example below. Proposition 0.3 Let , and be subsets of a set . Then31 ∩ ( ∪ ) = ( ∩ ) ∪ ( ∩ ). 31This is known as a distributivity law, compare the last statement of Fact 1. 23 Proof. To show that two sets are equal we have to establish that all the elements of the rst set occur in the second, and vice versa. Sometimes it is easier to give this as two separate proofs, and sometimes it can be done all in one go. ∩ ( ∪ ) = { ∈ | ∈ and ∈ ∪ } def ∩ = { ∈ | ∈ and ( ∈ or ∈ )} def ∪ = { ∈ | ( ∈ and ∈ ) or ( ∈ and ∈ )} see below = { ∈ | ∈ ∩ or ∈ ∩ } def ∩ = ( ∩ ) ∪ ( ∩ ) def ∪ . The key step in the proof is the statement that, for ∈ , we have ∈ and ( ∈ or ∈ ) if and only if ∈ and ∈ or ∈ and ∈ . We can make a case distinction: If ∈ and ∈ or ∈ then at least one of ∈ and ∈ and ∈ and ∈ must hold, which justies our original argument. Eectively we are applying here rules of logic which are explained in more detail in Chapter 3. Alternatively we can show that the two sets are included in each other. We rst show that ∩ ( ∪ ) is a subset of ( ∩ ) ∪ ( ∩ ). ∈ ∩ ( ∪ ) implies ∈ and ∈ ∪ def ∩ implies ∈ and ( ∈ or ∈ ) def ∪ implies ∈ and one of ( ∈ or ∈ ) implies ( ∈ and ∈ ) or ( ∈ and ∈ ) see above implies ∈ ∩ or ∈ ∩ def ∩ implies ∈ ( ∩ ) ∪ ( ∩ ) def ∪ . Next we show that ( ∩ ) ∪ ( ∩ ) is a subset of ∩ ( ∪ ). ∈ ( ∩ ) ∪ ( ∩ ) implies ∈ ∩ or ∈ ∩ def ∪ implies ( ∈ and ∈ ) or ( ∈ and ∈ ) def ∩ 24 implies in either case we have ∈ , and we must also have at least one of ∈ or ∈ implies ∈ and ∈ ∪ def ∪ implies ∈ ∩ ( ∪ ) def ∩ . EExercise 7. Assume that and are subsets of a set . (a) Show that the complement relative to of the union of and is the intersection of the complements (relative to ) of and . Hint: Turn the sentence into an equality of sets. Look at the proof of Proposition 0.3 for an example how to prove that two sets are equal. (b) Show that the union of two sets may be written using only the complement and the intersection operations. Hint: Use your equality from the previous part. (c) Give an argument that we may describe precisely the same sets using (∪, ∩ and ∖) as using (∩ and ∖). A useful operation assigns to a nite set the number of elements in that set, which is written as3233 ||. For all sets of numbers we have useful operation that allows us to extract the smallest/largest number from a set, provided it exists, which is always the case if the set is nite. Given a set of numbers we write min for the smallest number in if it exists, and max for the largest number in if it exists. Example 0.8. We have that min{1, 2, 3, } = 1 and max{1, 2, 3, } = , and min[0, 1] = 0, while max[0, 1] = 1. 32Some texts may use # instead. 33If you are not familiar this notation to describe a function come back to this once you have read Section 0.3. 25 Note, however, that min(0, 1) and minR are not dened, and that the same is true for max(0, 1) and maxR. 0.2.3 Describing sets Describing sets precisely is harder than it may sound. If a set has nitely many elements then, in principle, we could list them all. But if there are a lot of them this is rather tedious and time-consuming. People often resort to using . . . to indicate that there are members that are not explicitly named, and they hope that it is clear from the context what those members are. Take for example {0, 1, 2, . . . , 100, 000}. But whenever this notation is used there is room for confusion. It is much better to give a more precise description such as { ∈ N | ≤ 100, 000}. The idea behind this kind of description is that one describes the set in question as a subset of a known set (here N), consisting of all those members satisfying a particular property (here being less than or equal to 100, 000). In logic such a property is known as a predicate. It is almost inevitably the case that any set we might want to describe is a subset of a set already known, so this technique works remarkably often. In general the format is to have a known set and to dene { ∈ | has property }. Example 0.9. Let’s assume we want to describe the set of even natural numbers. We could write {0, 2, 4, 6, . . .}, but this leaves it to the reader to make precise which elements belong to the set and which ones don’t. This is strongly discouraged. Instead we could write the preferable { ∈ N | even}, but that assumes that the reader knows how the even property is dened. If we want to leave no room for doubt we could apply the denition (see Denition 4) and write { ∈ N | mod 2 = 0}. This makes it precise which members belong to our set—indeed, it gives us a test that we can apply to some given natural number to see whether it belongs to our set. 26 Example 0.10. Because we may form intersections and unions of sets we may also specify sets consisting of all those elements which have more than one property. All even numbers up to 100, 000 could be described as an intersection, namely { ∈ N | mod 2 = 0} ∩ { ∈ N | ≤ 100, 000}, but it is more customary instead to combine both properties by using ‘and’, that is { ∈ N | mod 2 = 0 and ≤ 100, 000}. When looking at the real numbers there is a standard way of dening subsets which give a contiguous part of the real line: [, ] = { ∈ R | ≤ ≤ } and (, ) = { ∈ R | < < }, or [, ) = { ∈ R | ≤ < }, but also (−∞, ] = { ∈ R | ≤ }. Sets of this form are known as ‘real intervals’. Note that we use the notation R+ = [0,∞), for non-negative real numbers. We may also use the idea of dening sets using properties to describe all those elements of a given set which satisfy at least one of several properties. Example 0.11. An example of this idea is given by { ∈ N | mod 2 = 0 or mod 2 = 1}, which is the union of two sets, namely { ∈ N | mod 2 = 0} ∪ { ∈ N | mod 2 = 1}, and this set is equal to N. Example 0.12. It is also possible to use this idea to specify the elements that do not have a particular property. The odd natural numbers are those that are not even. { ∈ N | mod 2 ̸= 0} = N ∖ { ∈ N | mod 2 = 0} = { ∈ N | mod 2 = 1}. 27 Example 0.13. If we want to describe the rational numbers as a subset of R we may use { ∈ R | there exists and in Z such that = /}, Example 0.14. Nothing stops us from specifying { ∈ N | mod 2 = 0 and mod 2 = 1}, which is a rather complicated description of the empty set. It is possible to use innitely many restricting properties. Example 0.15. Given a natural number , the multiples of can be written as { ∈ N | mod = 0.}. So the set of natural numbers which are not multiples of is { ∈ N | mod ̸= 0}. Example 0.16. A more complicated question is how to describe the set of all prime numbers. For that it helps to consider the set of elements which are not a multiple of any number other than one and themselves, which is equivalent to saying that they are not a multiple of any number with a factor of at least 2. The set of all multiples of in N, with a factor of two or greater, is given by { ∈ N | mod = 0, div ≥ 2.}, and the set of all numbers which are not such a multiple is N ∖ { ∈ N | mod = 0, div ≥ 2, }. which is the same as { ∈ N | mod ̸= 0 or ( mod = 0 and div = 1.}, which is the same as { ∈ N | mod ̸= 0 or = }. Note that all these sets contain the number 1, which we would like to exclude from the set of prime numbers. This suggests that we can use the intersection of all these sets of non- multiples of , where ∈ N ∖ {0, 1}, to express the prime numbers. This set is given by ⋂︁ ∈N∖{0,1} { ∈ N ∖ {1} | mod ̸= 0 or = } Instead of restricting the elements of a known set to describe a new set it is sometimes possible instead to provide instructions for constructing the elements of the new set. This is the second important technique for describing sets.28 Example 0.17. An alternative way of describing the even numbers is to recog- nize that they are exactly the multiples of 2, and to write {2 | ∈ N}. The odd numbers may then be described as {2+ 1 | ∈ N}. But for a better answer, we should add something here. Read on to nd out what. We can think of this as constructing a new set, but usually this only makes sense when describing a subset of a previously known set. Certainly the notation assumes that we know what we mean by 2, or 2 + 1—this implies we know where the addition and multiplication operations that appear in these expressions are to be carried out. In this examples it is in N, so it would be better to write {2+ 1 ∈ N | ∈ N}. This may seem obvious, since N is explicitly named as the set belongs to, but assume that in order to describe the rational numbers we wrote {/ | (,) ∈ Z× (Z ∖ {0})}. But what does this mean? Where do we take /? This is not dened in the collection Z of numbers where and live, and so it is not clear what we mean here. Maybe this is a set of formal fractions? We could clarify this by writing {/ ∈ R | (,) ∈ Z× (Z ∖ {0})}, from which it is clear that we mean to collect all the results of calculating · −1 within the real numbers. Example 0.18. To describe the integer multiples of (for example if we want to have all the points on the real line for which sin takes the value 0) we might write, { | ∈ Z}. Again we have to deduce from the context where is meant to be carried out. If we write { ∈ R | ∈ Z}, then everything is made explicit. In general what we have done here is to assume that we have two known sets, say and , and a way of producing elements from the second set from the rst, using a function : . We then write { ∈ | ∈ } for the set of all elements of which are ‘generated’ by elements of using the function . 29 We are using the notation {. . . | . . .} in two ways that look dierent, but we can think of the statement ∈ as a property so this notation is not inconsistent. One could even combine the two ideas. You should think of the vertical line as saying ‘such that’, so { ∈ Z | is even} can be pronounced as the set of all in Z such that is even and {2 ∈ Z | ∈ N} can be pronounced as the set of all those 2 in Z for which is in N. Note that these two sets are not equal! CExercise 8. For the sets given below, give a description using a predicate (as in Example 0.9), and also give a description where you generate the set (as in Example 0.17). (a) Describe the set of all integers that are divisible by 3. (b) Describe the set of all integers that are divisible by both, 2 and 3. (c) Describe the set of all integers that are divisible by 2 or by 3. To generate this set you need to use the union operation. (d) Describe the set of all integers that are divisible by 2 or by 3 but not by 6. To generate this set you need to use the union, and the relative complement, operations. (e) Describe the set of all real numbers for which cos = 0. 0.2.4 Constructions for sets There is one other fairly common construction for sets. Denition 9: product of two sets Given sets and their34 product, × , is the set {(, ) | ∈ and ∈ }. This means that the elements of the product are pairs whose rst component is an element of and whose second component is an element of . Products of sets appear in many places, and the examples we give below barely scratch the surface. 34This is also known as their Cartesian product. 30 Example 0.19. The product of the set {0, 1} with itself is the set with the elements (0, 0), (0, 1), (1, 0), (1, 1), so {0, 1} × {0, 1} = {(0, 0), (0, 1), (1, 0), (1, 1)}. Example 0.20. A more familiar example is a deck of cards: You have four suits, clubs ♣, spades ♠, hearts ♡ and diamonds ♢, and you have standard playing cards, say 7, 8, 9, 10, , , , in a 32-card deck. Each of those cards appears in each of the suits, so you have four Queens, one each for clubs, spaces, hearts and diamonds. In other words, your 32 card deck can be thought of as the product {♣,♠,♡,♢} × {7, 8, 9, 10, ,,,}. We can picture the result as all combinations of elements from the rst set with elements from the second set. The accepted standard for describing cards is to rst give the value and then the suit, so in the table below 9♢ is the notation used for the element (♢, 9) of our product set. 7 8 9 10 ♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣ A♣ ♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠ A♠ ♡ 7♡ 8♡ 9♡ 10♡ ♡ ♡ ♡ ♡ ♢ 7♢ 8♢ 9♢ 10♢ ♢ ♢ ♢ ♢ Whenever you draw the graph of a function from R to R you do so in the product of the setRwith itself: You use the -axis to give the source of the function, and the -axis for the target, and you then plot points with coordinates (, ), where varies through the source set. Example 0.21. A very important set that is a product is the real plane R× R = R2 = {(, ′) | , ′ ∈ R}. This is the set we use when we draw the graph of a function from real numbers to real numbers (see Section 0.3.4), where we use the rst coordinate to give the argument, and the second argument to give the corresponding value. Example 0.22. Similarly, the -dimensional vector space based on R has as 31 its underlying set the -fold product of R with itself, R× R× · · · × R⏟ ⏞ times = R = {(1, 2, . . . , ) | 1, 2, . . . ∈ R}. Note that it is possible to recover the components of an element of a product: We have two functions,35 1 : × and 2 : × , known as the projection functions with the behaviour that, for all (, ) ∈ × we have 1(, ) = and 2(, ) = . In general, if is a set, people often write 2 for × and more generally, for the -fold product of with itself. The elements of this set can be described as -tuples of elements of , that is = {(1, 2, . . . ) | ∈ for 1 ≤ ≤ }. Note that here we construct a new set, and we dene what the elements of the set are (namely pairs of elements of the given sets) and we do not have to identify an ambient set. In Section 2.5 we describe operations on sets as functions (see Sections 0.3) and for that we require the product construction. A binary operation36 is one that takes two elements from a set, and returns one element of the same set. Example 0.23. Addition for the natural numbers N is a binary operation on the set N. As a function (see Section 0.3) it takes two elements, say and , of N, that is an element (, ) of N× N, and returns an element + of N. The type of this operation is N× N N. But, of course, we may also consider addition for dierent sets of numbers, 35You may want to come back to this after reading Section 0.3. 36That is for example an operation which takes two numbers and returns a number. 32 giving operations, for example Z× Z Z Q×Q Q R× R R. Another general operation sometimes applied to sets is the disjoint union but we do not describe this here. Denition 10: powerset Given a set , its powerset , is given by the set of all subsets of ,. = { | ⊆ }. All our operations on sets were dened for elements of such a powerset. For example, given an element of , which is nothing but a subset of , we have ∖ , the complement of with respect to , which is another element of . Example 0.24. We may think of the union operation as taking two elements of , and returning37 an element of , so we would write that as ∪ : × , with the assignment given by (, ) ∪ , that is, given the arguments and in the function returns their union, ∪ . Because there are so many operations on the powerset it turns out to be a useful model for various situations. In the material on logic we see how to use it as a model for a formal system in logic. Sometimes we care only about the nite subsets of a set, that is { ⊆ | is nite}. People sometimes call this ‘the nite powerset’, but that is a bit problematic since this often isn’t itself a nite set. 0.3 Functions One could argue that sets are merely there to allow us to talk about functions, and while this is exaggerated sets wouldn’t be much use without the ability to move between them. 0.3.1 Function, source, target, range A function is a way of turning entities of one kind into those of another. Formally a function : is given by • a source set 37You may want to come back to this example after reading Section 0.3. 33 • a target set and • an instruction that turns every element of into an element of , often38 written as . Many people allow giving functions without specifying the source and target sets but this is sloppy. Every function has a type, and for our example here the type is → . Some instructions can be used with multiple source and target sets. For example 2 may be used to dene a function • N→ N, • Z→ Z, • Q→ Q, • R→ R. and 2 could have the types (among others) • N→ N, • Q→ Q or Q→ Q+, • R→ R or R→ R+. 0.3.2 Composition and identity functions Which functions we can dene from one set to another depends on the structure of the sets, and on any known operations on the sets. Only one (somewhat boring) function is guaranteed to exist for every set . 38It is quite often standard to write () but as long as the argument is not a complicated expression this is unnecessary. 34 Denition 11: identity function The identity function id on a set given by the assignment id : . An important operation on functions is given by carrying out one function after another. Denition 12: composite of two functions Given two functions : and : where the target of is the source of , the composite of and ∘ : (), is dened by rst applying to and then to the result, that is , and so overall we map ∈ to ∈ . Composition allows us to build more complicated functions from simple ones. Example 0.25. One may think of a linear function on R, of the form + to be the result of composing the following two functions R R: : and : + , since this amounts to calculating + R R R. In other words, if we apply the function to , and the function to the result, we nd that overall is mapped to + . Example 0.26. You probably have used the notion of a composite already. You 35 may nd it easier to realize this by looking at the assignment √︀| sin| from R to R. This tells you to rst apply the sine function to , and to apply the square root function to the absolute of the result. The idea of composing functions just makes this explicit, and it also forces you to ensure that the output of the rst function is always a valid input to the second function. Hence we may express the given function as the composite of the following functions: : R R sin : R R+ || ℎ : R+ R √ in the sense that the given function is ℎ ∘ ∘ which means that the assignment given is the same as ℎ(()). Note that we could have specied a dierent target for the sine function, such as the real interval [−1, 1], and made that the source of the following function. In order to dene a function you have to specify its source and target. Don’t forget to do this. CExercise 9. Dene three functions such that their composite is a function R R which maps an input to the logarithm (for base 2) of the result of adding 2 to the negative of the square of the sine of the input. Hint: To dene a function you need to give its source and target. You need to make sure that your functions can be composed. 0.3.3 Basic notions for functions It can sometimes be useful to determine which part of the target set is reached by a function. Denition 13: image, range of a function Given a function : , for ∈ we say that is the image of under , and the set { ∈ | ∈ } is the range of f. It is also known as the image of the set , and written []. 36 Note that we may also write the range of , which can also be thought of as image of the set under the function , by using a property of elements of as { ∈ | there exists ∈ with = }. Example 0.27. For the sine function sin : R R, the image of 0 under sin is 0 = sin 0, and the range of sin is the set [−1, 1]. Example 0.28. If we formally want to dene the notion of a sequence of, say, real numbers, then we should do so as a function from N to R. The th member of the sequence is given by . In such cases is often written . For example, the sequence given on page 14 would have the rst few values argument 0 1 2 3 4 value 1 1/2 1/4 1/8 1/16, and the formal denition of this function is N R 1 2. We may also think of a function as translating from one setting to another. In Java casting allows us to take an integer, int and cast it as a oating point number, float. This is eectively a function which takes an int (which amounts to a number of bits) and translates it into what we think of as the same number, but now expressed in a dierent format. Similarly in Python it is possible to ‘convert’ numbers of one type into another, for example float(x) takes a number in a dierent format, for example an integer, and converts it into a oating point number. For a mathematical example, we note that we have functions connecting all our sets of numbers since N is embedded in Z which is embedded in Q which is embedded in R. All these embeddings are functions, but they are so boring that we don’t usually bother to even name them. For a slightly more interesting example take the set of all fractions. From there we have a function that maps a fraction to the corresponding rational number (and so 1/2 and 2/4 are mapped to the same number), allowing us to translate from the presentation as fraction to the numbers we are really interested in. If you have a customer database you could print a list of all of your customers. You have eectively constructed a function that takes an entry in your database and maps it to the name eld. Note that if you have two customers called John Smith then that name will be printed twice, so thinking of a ‘set of names’ is not entirely appropriate here. 37 If we have small nite sets then one can dene a function in a graphical way, by showing which element of the source set is mapped to which element of the target set. We give an example of this below. Example 0.29. We draw a function {, , } {1, 2, 3, 4}. ∙ ∙ ∙ ∙ 4 ∙ 1 ∙ 2 ∙ 3 This function maps to 1 and and to 3. Note that in order for such a diagram to describe a function, every element of the source set must be mapped to precisely one element of the target set. 0.3.4 The graph of a function It can be useful to think of a function via its graph. Denition 14: graph of a function The graph of a function is the set of pairs consisting of an element of the source set with its image under the function,39 that is, given : its graph is the set {(, ) ∈ × | ∈ }. We can see what this denition means by assuming we are given a function : and noting that this denition tells us that is graph is the set {(, ) ∈ × | ∈ } which is a subset of the product of and . See Proposition 2.1 for a characterization of those subsets of × which are the graph of a function from to . When we have functions between sets of numbers we can draw a picture of their graph. Example 0.30. Let’s return to the function from Example 0.28, which is given by N R 1 2. 39And indeed, a standard way of dening functions in set theory is via their graphs. 38 Its graph can be drawn as follows. 0 1 1 2 3 4 ∙ ∙ ∙ ∙ ∙ More examples appear in the following section. For functions between nite sets drawing the graph in this way is usually not particularly useful. The graph of the nite example above is {(, 2), (, 3), (, 3)}, and one might draw it as follows: 1 2 3 4 This does not really show anything that is not visible in the previous diagram. 0.3.5 Important functions When we are interesting in judging how long a computer program will take we typically count the number of instructions that will have to be carried out. How many instructions these are will, of course, depend on the program, but also on the particular input we are interested in. Often the inputs to a program can be thought of as having a particular size: For example, sorting ve variable of type int will be quite dierent from doing so for one million such variables. Typically the number of instructions a program has to carry out depends on the size of the input rather than the actual input, and so we can think of this as dening a function from N to N which takes the size of the input to the number of instructions that are carried out. There are a number of functions that typically appear in such considerations.40 In computer science it would be sucient for these purposes to consider these functions as going from N to N, but it is often more convenient to draw their graphs as functions from R+ to R+. In what follows we consider functions that commonly appear in that setting, and where possible we draw their graph as functions from R to R. There are linear functions, which are of the form R R + and their graphs look like this. 40You will meet them again when you look at this in more detail in COMP11212 and COMP26120. 39 A typical quadratic function is given by R R 2 + + and (for some values of , and ) its graph looks like this: Other polynomial functions, that is functions of the form ∑︁ =1 may also feature. Sometimes we wish to consider functions which involve the argument being taken to a power other than a natural number, for example R+ R+ 1/2 = √ . 40 Some of these functions are dened for non-negative numbers only, so their source is R+, rather than all of R. Note that for xed ∈ R+ this function only gives the positive solution of the equation = 2. If you want to refer to both of these41 you have to write ±√. Apart from these polynomial functions, important examples that come up in computer science are concerned with logarithmic functions. In computer science one typically wishes to use logarithms to base 2. They are typically written as [1,∞) R+ log and look like this. 1 And then there are exponential functions. Because of the speed with which these grow having a program whose number of instructions is exponential in the size of the problem is a serious issue since it means that it is not feasible to 41If you have been taught otherwise then this is at odds with notation used at university level and beyond. 41 calculate solutions for larger problem sizes using this program. It is fairly usual to use 2 as a base once again. The function in question is R R 2 and its graph is even steeper than that of the quadratic curve above. 1 In all these cases typically the shape of the curve is more important than any parameters involved in dening it—so knowing that we have a quadratic function is very useful, whereas there is little added benet in knowing , and in 2 + + . If one has a problem size of 1,000,000, for example, then it is important to know how fast the function grows to see how many instructions will have to be carried out for that size (and so how long it will take for the program to nish, or it if is possible for this program to nish at all). If we draw the functions from above in the same grid (note that we have compressed the -axis here) we can compare them. 1 1 2 .5+ 1 2 log 42 The issue of how to compare functions when we are only interested in how they do for large inputs is discussed in Section 5.1, and this is relevant for calculating the complexity of a programme or algorithm. Note that Fact 7 gives us a lot of material when it comes to comparing numbers that we can use to also compare functions: Example 0.31. Let us consider the following functions: : [1,∞) R 2 : [1,∞) R 3. We can show that for all ∈ [1,∞) we have that 2 ≤ 3, and so ≤ : Given such an we have that 2 = 1 · 2 1 unit for mult ≤ · 2 1 ≤ , Fact 7 = 3. When we come to comparing functions in Section 5.1 you will nd the following comparison for functions helpful. Fact 8 We have that for all ∈ N that 2 ≥ + 1 as well as ≥ log(+ 1). This statement is formally shown in Exercise 145. Two functions which are useful when we need to convert results which are real or rational into integers. The oor function R Z ⌊⌋ maps a real number to the greatest integer less than or equal to it. See Ex- ample 4.71 if you want to nd out how to draw a graph for a function like this. The ceiling function R Z ⌈⌉ maps a real number to the smallest integer greater than or equal to it. 43 0.3.6 Functions with several variables You may have been taught about functions with several variables as being somehow more general then functions with one variable. However, this is not really the case. If we have a function whose source set is a product set, for example : R2 R then every argument for this function is a pair, because every element of R2 is a pair. It may be useful to have access to the two components of the argument, and so it is fairly common to write something like (, ) = + to describe the behaviour of the function . If we had insisted of using ∈ R2 to describe the argument of then we would have to write42 = 1 + 2, which is much less clear. So, a function with several arguments is a function whose source set is a product, and where we have written the argument to have as many components as the product set has factors. Examples 0.23 and 0.24 talk about functions with two arguments and you should go back to them and look at them once more now. 0.3.7 Constructions for functions An important way of constructing new functions from old ones is what is known as denition by cases. What this means is that one pieces together dierent functions to give a new one. Example 0.32. Assume we want to give a proper denition of the ‘absolute’ function |·| : R R+ for real numbers. The value it returns depends on whether the input is negative, or not. The graph of this function is depicted here. We can write the corresponding assignment as {︃ ≥ 0 − else. 42Recall the projection functions 1 and 2 from Section 0.2.4. 44 Example 0.33. If you want to give an alternative description of the function Z Z mod 2, which maps even numbers to 0, and odd numbers to 1, you could instead write {︃ 0 mod 2 = 0 1 else or, if you don’t want to put the mod function into the denition, you could write {︃ 0 even 1 else. What is important is that • you give a value for each element of the source and • you don’t give more than one value for any element of the source. In other words, on the right you must split your source sit into disjoint parts, and say what the function does for each of those parts. 45 Example 0.34. You might need this when you are trying to describe the be- haviour of an entity which changes. For example, assume you are given the following graph: 1 1 This function R+ R+ is given by the assignment ⎧⎪⎪⎨⎪⎪⎩ ∈ [0, 1] 1 8 2 + 7 8 ∈ (1, 4] 1 2 + 7 8 else. CExercise 10. Write down formal denitions for the following functions. (a) The function which takes two integers and returns the negative of their product. (b) The function from R× R to R which returns its rst argument. (c) The function from Z× Z to {0, 1} which is equal to 1 if and only if both arguments are even. (d) The function fromR toRwhich behaves like the sine function for negative arguments, and like the exponential function for base 2 for non-negative arguments. (e) Draw a picture of the set {,, ,} × {, , }. Dene a function from that set to {0, 1} which is 1 if and only if its rst argument has more letters than its second. Apart from this, the constructions we have for sets are also meaningful for functions. If we have functions : ′ and : ′. Then we can dene a func- tion × → ′ × ′, which we refer to as × 46 by setting (, ) (, ). Optional Exercise 2. Can you think of something that would allow you to extend the powerset construction to functions? The following exercises draws on functions, as well as on the denition of the powerset from the previous section. EExercise 11. Given a set , dene the following functions. Don’t forget to write down their source and target. (a) A function from to its powerset with the property that for every ∈ we have ∈ . (b) A function from the product of the powerset of with itself, to the powerset of , with the property that a pair of sets is mapped to the set consisting of all those elements of which is either in the rst set, or in the second set, but not in both. (c) Dene a function from the product of with its powerset to the set {0, 1} which returns 1 if and only if the rst component of the argument is an element of the second component. (d) Dene a function from the set of nite subsets of N to N which adds up all the elements in the given set. 0.4 Relations We study relations in detail in Chapter 7. Prior to that chapter, however, relations play a (minor) role in Chapter 3 and we give the basic ideas here for that reason. Sometimes we have connections between two sets and which do not take the form of a function. We might have some set of pairs of the form (, ), where ∈ and ∈ . Such a set is known as a binary relation. Note that relations of other arities exist, but it is customary to drop the ‘binary’ part and just speak of a relation. Example 0.35. Consider the set of all the rst year students in the School of Computer Science, and the set of all course units on oer in the university. We may then dene a relation as {(, ) ∈ × | is enrolled on }. This set is encoded in a database somewhere in the student system. Relations are very exible when it comes to capturing connections between various entities. A number of examples are given in Chapter43 7, but here are some ideas for the kind of thing that one can do: 43Note that this chapter is studied in Semester 2. 47 • Sometimes a set of interest may contain a number of elements one wishes to consider ‘the same’, for example when using fractions to describe the rational numbers. One may use an equivalence relation (between the set and itself) to partition the set into equivalence classes and use those instead of the original elements. An example of this is the relation which connects two students if and only if they are in the same lab group. • One may wish to compare the elements of a set with each other, indicating that one is below another. This is done using a relation between the set and itself known as a (partial) order. Examples of these are the usual orders on N, Z, Q and R, but more interesting options exist. How does one describe a relation? The most common description is that of a subset of the product as in the example above, similar to the graph of a function. This is a set, so the usual suggestions for describing sets apply. Quite often it is possible to describe a relation using a predicate. Example 0.36. The relation which connects the integers and if divides is {(,) ∈ Z× Z | mod = 0}. We may apply more complicated conditions to pick out the set of pairs we want to describe. Example 0.37. The equality of fractions as rational numbers provides another example. This relation is dened as {(/, ′/′) | , ′ ∈ Z, , ′ ∈ Z ∖ {0} and ′ = ′}. It is less often the case that one can use the idea of generating the relation as a set. This typically only works if there is a way of expressing one of the pair in terms of the other. Example 0.38. The relation (, ) in 2 with 2 = can be generated as {(, 2) | ∈ R}. Note that in this particular case it is also possible to describe the same relation as the union of two sets, namely as {(√, ) ∈ R× R | ∈ R+} ∪ {(−√, ) ∈ R× R | ∈ R+}. In a case like this it is easy to show a picture of the set in question. 48 If the relation is nite (and small) then it may be possible to list all the elements it contains. In this case it is also possible to draw a graph to indicate which elements are related. Example 0.39. Here is the kind of graph one might draw for a small relation. ∙ ∙ ∙ ∙ 4 ∙ 1 ∙ 2 ∙ 3 This is the relation which relates to 4, 3 and 1, it relates to 3, and it relates to nothing at all. Its set description is {(, 1), (, 3), (, 4), (, 3)}. Alternatively one could draw those pairs in the product set that belong to the relation in this way, similar to the graph of a function (see Section 0.3.4): 1 2 3 4 You may think of the grid as giving all the possible combinations when picking one element from {, , } and one from {1, 2, 3, 4}. The dot tells you whether the corresponding pair belongs to the relation, or not. Note that every function denes a relation between its source and its target via its graph. These are very special relations, described in more detail in Section 7.1. If we have a binary relation from one set to itself then we can picture this by drawing connections between the elements of the given set. Typically we would say that we have ‘a (binary) relation on the set ’ instead of ‘a (binary) relation from to ’. This is a picture of the following relation on the set {, , , , }: {(, ), (, ), (, ), (, ), (, ), (, )}. Note that relations do not have to be binary, they can have a higher arity. A ternary relation for sets , and , for example, is a subset of × × . This 49 kind of relation is dicult to picture in two dimensions, so typically no pictures are drawn for these. 50 Chapter 1 Complex Numbers The real numbers allow us to solve many equations, but equations such as 2 = −1 have no solutions in R. One way of looking at the complex numbers is that they remedy this problem. But assuming this is all they do would sell them far short. We here give a short introduction to the set of complex numbers, addition and multiplication operations for them, and their basic properties. Note that in order to solve exercises in this chapter you should only use properties given by Facts 1 to 7 in Chapter 0. 1.1 Basic denitions We begin by giving some basic denitions. Denition 15: complex numbers The set of complex numbers C consists of numbers of the form + , where , in R. Here is known as the real part and as the imaginary part of the number. At rst sight it is not entirely clear what exactly we have just dened. One may view + as an expression in a new language. If one of or is 0 it is customary not to write it, so the complex number is equal to + 0 and the complex number is equal to 0 + . Similarly, if = 1 then it is customary to write + instead of + 1. We may think1 of a real number as being a complex number whose imaginary part is 0, so it has the form +0. In that way the complex numbers can be thought to include the real numbers (just as we like to think of the real numbers as including the rational numbers). This gives a function from R to C dened by + 0. 1Compare this with casting a value of one datatype to another in Java. 51 Complex numbers are usually drawn as points within the plane, using the hori- zontal axis for the real and the vertical axis for the imaginary part. im pt real pt + Above we have added labels for orientation, but usually this is done a bit dierently. Instead of marking the real and the ima- ginary part on the axes it is more com- mon to mark the ‘imaginary axis’ with , giving a picture in the complex plane. + Example 1.1. We show how to draw the numbers 2 + 3, −3 and − in the complex plane. 2 + 3 − −3 The complex plane is naturally divided into four quadrants. 52 1.2 Operations There are quite a few operations one denes for complex numbers. 1.2.1 The absolute The2 absolute |+ | of a complex number + is given by√︀ 2 + 2. We may think of this as the length of the line that connects the point 0 with the point + : + √ 2 + 2 Example 1.2. The absolute of the complex number 1 + 2 is calculated as follows. |1 + 2| = √︀ 12 + 22 = √ 5. Note that this extends the notion of absolute for real numbers in the sense that |+ 0| = √︀ 2 + 0 = √ 2 = || where we use the absolute function for real numbers on the right.3 One can calculate with the complex numbers based on the following operations. 1.2.2 Addition Addition of two complex numbers is dened as follows. We set (+ ) + (′ + ′) = (+ ′) + (+ ′). ′ + ′ ′ + ′ + ′ + ′ (+ ) + (′ + ′) To understand addition it is useful to think of the numbers in the complex plane as4 vectors, then addition is just the same as the addition of vectors: 2This is also known as the modulus of a complex number. 3And indeed note that we could use √ 2 as a denition of the absolute for a real number . 4Vectors will be taught in detail in the second half of Semester 2. 53 ′ + ′ ′ + ′ + ′ + ′ (+ ) + (′ + ′) If you prefer, you may think of this as taking the vector for ′+ ′ and shifting it so that its origin coincides with the end point of the vector for + .5 Example 1.3. We calculate the sum of 1 + 2 and −1 + as follows. (1 + 2) + (−1 + ) = (1− 1) + (2 + 1) = 3. We note that if we have two complex numbers whose imaginary part is 0, say and ′, then their sum as complex numbers is + ′, that is their sum as real numbers. Important properties of this operation are established in Exercise 27 and 28, which establish two equalities from Fact 6 for the complex numbers. Note that 0 is the unit for addition6, that is adding 0 to a complex number (on either side) has no eect.7 In other words we have 0 + (+ ) = (0 + ) + (0 + ) def addition = + Fact 6 = (+ 0) + (+ 0) Fact 6 = (+ ) + 0. def addition For the real numbers every element has an inverse for addition in the form of −: This is the unique number8 which,9 if added to on either side, gives the unit for addition 0. For addition of complex numbers we can nd an inverse by making use of the inverse for addition for the reals. The following lemma explains how to calculate the additive inverse, and it also establishes that such an inverse exists for all complex numbers. Lemma 1.1 The additive inverse of the complex number + is −− , 5Note that you may just as well think of shifting the vector for + such that its origin coincides with the end point of the vector for ′ + ′. 6Look at the unit for addition given by Facts 1, 3, 5 and 6. 7For a formal denition of the unit of an operation see 20. 8Again, compare Facts 3. 5 and 6 from the previous chapter. 9For a formal denition of the inverse for a given element with respect to a given operation see 21 in the following chapter. 54 which we often write as −(+ ). This establishes that every element ofC has an additive inverse and that means we may dene subtraction for complex numbers by setting (+ )− (′ + ′) = (+ ) +−(′ + ′), so as usual this is a shortcut to deducting the additive inverse of the second argument from the rst argument. Exercise 12. Prove Lemma 1.1. Hint: The paragraph above this exercise tells you what you need to check, or look ahead to Denition 21. Note that there is no easy connection between the absolute and addition; the best we may establish for complex numbers and ′ is that | + ′| ≤ ||+ |′|. Note that ( + ) + ( + ) = 2 + 2, and that we may think of this as stretching + to twice its original length, and write it as 2(+ ): 2 2 2(+ ) + In general, given a real number and a complex number + we may dene (+ ) = + . Note that this means that our denition of the negative of a complex number works in the expected way in that −(+ ) = (−1)(+ ) = (−1)+ (−1) = −− . Example 1.4. We see that 3(5 + ) = 15 + 3, and we further calculate −√2(√2−√2) = −2 + 2. CExercise 13. Draw the following numbers in the complex plane: 2, −2, 2, −2, 3+ ,−(3+4), (−1+2)+(3+ ), (1+2)+(3− ), (1+2)− (3+ ). For each quadrant of the complex plane pick one of these numbers (you may pick at most two numbers lying on an axis, and the axes have to be on dierent ones), and calculate its absolute. Assume your friend has drawn a complex number on a sheet that you cannot see. Instruct them how to draw the following. (a) −, (b) 2, 55 (c) 3, (d) , where is an arbitrary real number. Assume they have a ruler. They are supposed to draw these numbers without referring to coordinates or carrying out calculations on the side. Exercise 14. Consider the function fromR2 toCwhich is dened as follows: (, ) + . Show that (, ) + (′, ′) = ((, ) + (′, ′)) for all (, ), (′, ′) ∈ R2. Here we use the componentwise addition for elements of R2, that is (, ) + (′, ′) = (+ ′, + ′) for all , ′, and ′ in R. 1.2.3 Multiplication We dene the multiplication operation on complex numbers by setting (+ )(′ + ′) = ′ − ′ + (′ + ′). Example 1.5. We calculate (1 + 2)(2− 3) = (2 + 6) + (4− 3) = 8 + . Exercise 15. Show that 1 is the unit for multiplication. Hint: Check the calcu- lation carried out above which shows that 0 is the unit for addition. Also look at Fact 1 which tells you what it means for 1 to be the unit for multiplication of natural numbers. Note that if one of the numbers has imaginary part 0 then we retain the multiplication with a real number dened above, that is (′ + ′) = ′ + ′. There is a geometric interpretation of multiplication, but it is a bit more complicated than that for addition. We here only give a sketch of this. Instead of giving the coordinates and to describe a point in the complex plane one also could give an angle and a length. Denition 16: polar coordinates The description in polar coordinates of a complex number or its polar form consists of a non-negative real number, known as the absolute and an angle in [0, 360) (or in [0, 2)) known as the argument. 56 + Note that the absolute of a complex number in this sense is nothing but the absolute |+ | from above. Example 1.6. The complex number 1+ has the absolute √ 2 and the argument 45∘ or, if you prefer, /4. One might use a notation such as ( √ 2, 45∘) (or ( √ 2, /4)) for complex numbers given in this way. This means there are two ways of describing a complex number: via • the real part and • the imaginary part or via • the absolute and • the argument . To move from polar coordinates to the standard form there is a simple formula: The complex number given by and is (cos+ (sin)). In the other direction one can use the arctangent function arctan, the partial inverse of the tangent function, to calculate given and , but a few case distinctions are required. Optional Exercise 3. Write out the denition of the function that gives the argument, for a complex number + . Then prove that, starting from a com- plex number, calculating the argument and the absolute, and then calculating the real and imaginary part from the result, gives back the number one started with. Use these calculations to show that the argument of ′ is the argument of plus the argument of ′. Describing multiplication is much easier when we do it with respect to these polar coordinates: Fact 9 Assume that (, ) and (′, ′) are two complex numbers whose rst compon- ent denes the absolute, and whose second component gives the argument. Their product (in the same format) is given by the number (′, + ′). 57 ′ + ′ ′ ′ ′ ′ Note that there is a nice connection between the absolute and multiplication. Lemma 1.2 For complex number and ′ we have |′| = |||′|. Exercise 16. Prove Lemma 1.2. We note that according to the denition of multiplication we have = (0 + 1)(0 + 1) = 0− 1 · 1 + (0 · 1 + 1 · 0) = −1, so in the complex numbers we may solve the equation 2 = −1 which has no solution in R. CExercise 17. Pick four numbers in at least three dierent quadrants of the complex plane. Calculate, and then draw, the product of each of those numbers with the number . Your friend has drawn the number on the complex plane, but you can’t see what they are doing. Instruct them how to draw without referring to any coordinates. Optional Exercise 4. What happens if we keep multiplying with itself? What does that tell you about solutions to the equation 4 = 1? What about solutions for = 1 more generally? 58 Exercise 18. Consider the function fromR2 toCwhich is dened as follows: (, ) + . Dene addition and multiplication onR2 based on these operations for complex numbers. Hint: You may want to consult Exercise 14. We have seen that with regards to addition, every complex number has an inverse in the form of −. What about inverses for multiplication? Lemma 1.3 For every complex number + ̸= 0 the multiplicative inverse is given by10 2 + 2 − 2 + 2 . Sometimes the notation (+ )−1 is used for this number. More generally, if we have a complex number then it’s inverse, if it exists, is written as −1. Just as for real numbers the expression /′ is a shortcut for (′)−1. Do not divide by complex numbers in your work. The correct operation is to multiply with the multiplicative inverse, and for full marks you have to include an argument that this exists in the case you are concerned with. In particular if you would like to multiply with the multiplicative inverse of a variable you have to explicitly consider the case where that variable happens to be equal to 0. Whenever you use the number −1 you have to include an argument that this exists, that is, that ̸= 0 (and indeed you should do this whenever you use −1 for real or rational numbers). Recall that we avoid talking about division as an operation on this unit, so if you want to remove a factor from an equation please try to talk about multiplying with the multiplicative inverse, and think about whether this exists! EExercise 19. Prove Lemma 1.3 Hint: Check Fact 5 from the previous chapter to see what you have to prove, or look ahead to Denition 21. Note: We have not dened 1/ for a complex number and you should not use this expression. Calculate the inverse of the complex number = + 0 = . How does that compare with the multiplicative inverse for when viewed as an element of R? 10Note that our condition means that 2 + 2 ̸= 0 and therefore we may form the fractions given here. 59 EExercise 20. Assume you have a complex number in polar form (, ). What is the polar form of its multiplicative inverse? Hint: What does multiplication with the inverse have to give? Look at the picture on page 58 which explains multiplication in terms of absolute and argument to nd a number that satises the requirement for an arbitrary absolute and argument . In summary we have seen that just as for the real numbers, we may dene addition and composition for complex numbers, and in such a way that if we treat real numbers as particular complex ones, then the operations agree. Indeed, it is also possible to dene exponentiation and logarithms for complex numbers but this idea leads us too far aeld. 1.2.4 Conjugation There is a further operation that you may nd in texts that deal with complex numbers, namely conjugation. The conjugate of a complex number = + is given by − . Example 1.7. We give sample calculations. −2 + = −2− , 3 = 3, = −. Exercise 21. Assume you have a complex number given by its absolute and argument. What are the absolute and argument of its conjugate? Hint: If you nd this dicult draw a few examples in the complex plane. CExercise 22. Show that = ||2. 1.3 Properties The complex numbers have various properties which make them a nice collection of numbers to work with. You are allowed to use the following in subsequent exercises on complex numbers.11 Fact 10 Addition and multiplication of the complex numbers have all the properties of the real numbers as given in Fact 6. Optional Exercise 5. Prove the statements of Fact 10. Note that when it comes to solving equations, the complex numbers are even better behaved than the real ones: Every polynomial equation, that is an equation of the form ∑︁ =0 = + −1−1 + · · ·+ 1+ 0 = 0, 11You are not allowed to use them in exercises where you are specically asked to prove them! 60 where the are complex numbers has at least one solution12 in C, whereas this is not true in R even if the are all elements of R. This means that the complex numbers are particularly suitable for various constructions that depend on having solutions to polynomials. Note that it does not make any sense to use the square root operation for complex numbers. While for a positive number , we use the symbol √ to refer to the positive of the two solutions to the equation 2 = for a complex number there is no sensible way of of picking out one of the possible solutions of 2 = . Example 1.8. Consider the equation 2 = . We may check that there are two solutions, 1√ 2 (1 + ) and 1√ 2 (−1− ). Which of those should be the number we mean by √ ? You may think there’s still a sensible choice, namely the one where both, real and imaginary part are positive. Now consider the equation 2 = −. It has the solutions 1√ 2 (−1 + ) and 1√ 2 (1− ), and picking one over the other does not make sense. If we go to equations involving powers higher than two then the number of solutions increases. Example 1.9. Let us consider the following equation: 4 = 1. If is supposed to be a real number then there are two solutions, namely 1 and −1. If, on the other hand, we are allowed to pick solutions from C we note that there are at least four: 1, −1, , −. 12In fact, one can show that there are solutions, but this requires counting some solutions more than once. 61 Certainly there is no good way of picking out one of these solutions to determ- ine which number we might mean by 4 √ 1. For this reason there are no root operations on the complex numbers. Do not use square root symbols if you are interested in a solu- tion of the equation 2 = ′, where ′ is given. This is not a valid operation for complex numbers. An important dierence between the complex numbers and the other sets of numbers discussed in Chapter 0 is that we are used to thinking of the latter as being ordered, which means we can compare two elements. There is no way of turning C into an ordered set so that the statements in Fact 7 are true for that order. In analysis, which includes the study of functions, their derivatives and their integrals, the theory of functions of complex numbers is much smoother than its counterpart for the reals. In order to calculate various (improper) integrals for functions of real variables one may apply methods that require functions of complex variables. The fact that complex numbers may be thought of as having two parts, and that we have various operations for these, means they are particularly suited to a number of application areas where these operations may be interpreted. 1.4 Applications Complex numbers may appear articial, and having numbers with an ‘imagin- ary’ part may suggest that these are merely gments of some mathematicians’ imagination. It turns out, however, that they are not merely some artefact whose only use it is to deliver a number which is a square root of −1. Because complex numbers can eectively be thought of as vectors, but vectors which allow multiplication as well as addition, they are very useful when it comes to talking about quantities that need a more complex structure to express them than just one number. In physics and various areas of engineering quantities which may be described by just one number are known as ‘scalars’. Examples are distance, speed (although to describe movement one might want to include direction with speed) and energy. In a direct-current circuit, voltage, resistance and current are treated as scalars without problem. In alternating current circuits, however, there are notions of frequency and phase shift which have to be taken into account, and it turns out that using complex numbers to describe such circuits results in a very useful depiction. Moreover some calculations become much simpler when one exploits the possibilities given by modelling the circuit with complex numbers. Signal analysis is another area where complex numbers are often employed. Again the issue here are periodically varying quantities. Instead of describing these using a sine or cosine function of some real variable, employing the extensions of these functions to the complex numbers makes it possible to describe the amplitude and phase at the same time. There you will see for example, that by using complex numbers for a Fourier transform calculations that look complicated can be carried out via matrix multi- 62 plication.13 When you meet this material you should remind yourself of what you know about complex numbers from these notes. There are other areas where applications arise, such as uid dynamics, control theory and quantum mechanics. 13The latter will be treated towards the end of Semester 2 of this course unit. 63 Chapter 2 Statements and Proofs Mathematics is a discipline that relies on rigorous denitions and formal proofs. As a consequence, in mathematics statements hold, or they do not (but we may not know which it is).1 This is very dierent from the situation in the natural sciences, for example. Here a theory may be falsied by observations that contradict it, but there is no way of formally verifying it. How does a system that seeks to provide such certainty work? In principle the thought is that it is possible to dene a theory strictly from rst principles (typically starting with a formal theory of sets), with rules for deriving statements from existing ones. Such a system is very rigid and syntactic2 in nature, much like a computer language (and indeed there are computer programs that implement at least aspects of this). Statements that may be formally derived in the system are known as theorems. In principle it should be possible to t all of mathematics into a formal system like this.3 But in practice this is not what mathematicians do. There are two reasons for this. Starting from rst principles it takes a very long time to build up the apparatus required to get to where one may even talk about entities such as the real numbers with complete rigour. Secondly the resulting statements are very unwieldy and not human-readable. Hence mathematicians carry out their work in some kind of meta-language which in principle can be translated into a formal system. Increasingly there are computer-veried proofs in some formal system in various areas, in particular in theoretical computer science. In this and the following chapter of the notes we look at both these ideas— proofs as they are customarily carried out by mathematicians and a formal system. 2.1 Motivation You are here to study computer science rather than mathematics, so why should you worry about proving statements? There are two reasons one might give here. For one there is the area of theoretical computer science which arguably is also an area in mathematics. The aim of this part of computer science is to make formal 1There are also issues to do with whether a given formal system allows us to construct a proof or a counterexample. 2This means concerned with symbols put together according to some rules without any concern what they might mean. 3But there is a famous result by the logician Kurt Gödel, his Incompleteness Theorem, which tells us that any system suciently powerful for most of mathematics cannot prove its own consistency. 64 statements and to prove them. Here are some examples of the kind of statement that are of concern in this area. • This abstract computational device has the same computational power as another. • This computation is equivalent to another.4 • This abstract computational system behaves in a particular manner over time. • This problem cannot be solved by a computer, or, equivalently, there is no algorithm (or decision procedure) for it (see COMP11212). • The best possible algorithm for this problem requires a number of steps that is a quadratic function in the size of the problem (see COMP26120). • This program will terminate and after it has done so its result will satisfy a particular condition. • This circuit implements a particular specication. You can see that while the rst few statements sound fairly abstract the latter two look as if they might be closer to real-world applications. Secondly, under certain circumstances it is important to make absolute state- ments about the behaviour of a computational device (a chip or a computer program for example). Formally proving that programs behave in a particular way is labour- intensive (and creating a formal model of the real world in which the device lives is potentially error-prone). In safety-critical systems, however, the benets are usually thought to outweigh the cost. For example in an aircraft it is vital that the on-board computer behaves in a particular way. Emergency course corrections have to be made promptly and correctly or the result may be fatal for those on board. When NASA sends an explorer onto Mars, or the Voyager space craft to y through the solar system (and to eventually leave it) then it is vital that a number of computer-controlled man- oeuvres are correctly implemented. Losing such a craft, or rendering it incapable of sending back the desired data, costs large amounts of money and results in a major setback. But even outside such applications computing is full of statements that are at least in part mathematical. Here are some examples. • The worst case complexity of this algorithm is log (the kind of statement that you will see in COMP26120). • This recursive procedure leads to exponential blow-up. • A simple classication rule is to choose the class with the highest posterior probability (in articial intelligence or machine learning). • Time-domain samples can be converted to frequency domain using Fourier Transforms, which are a standard way of representing complex signal () as a linear sum of basic functions ()) (from COMP28512). 4What it might mean for two computations to be equivalent is a whole branch of theoretical computer science. 65 The aim of this course unit is to prepare you for both these: studying areas of theoretical computer science and making sense of mathematical statements that appear in other parts of the eld. 2.2 Precision Something the language of mathematics gives us is precision. You need to become familiar with some aspects of this. In particular, there are some key phrases which sound as if they might be parts of every-day language, which have a precise meaning in a mathematical context. 2.2.1 Key phrases Vocabulary that helps us with this are phrases such as • ‘and’, • ‘or’, • ‘implies’ (and the related ‘if and only if’), • ‘there exists’ and • ‘for all’. The aim of this section is to introduce you to what these phrases mean, and how that is reected by by proving statements involving them. Keyword And Formally we use this word to connect several statements, or properties, and we demand that all of them hold. When you enter several words into the Google search box you ask it to return pages which contain all the listed words—you are demanding pages that contain 1 and 2 and . . . If you are running database queries you are often interested in all entries that combine several characteristics, for example, you might want all your customers from a particular country for whom you have an email address so that you can make a special oer to them. These are all informal usages, but they have fundamentally the same meaning as more mathematical ones. Example 2.1. A very simple example is the denition of the intersection of two subsets and of a set . ∩ = { ∈ | ∈ and ∈ }. In order to prove that an element is in this intersection we have to prove both, that is in and that is in . It is a good idea to structure proofs so that it is clear that these steps are carried out. Here is an example of this idea. 66 Example 2.2. To show that 6 is an element of { ∈ N | mod 2 = 0 and mod 3 = 0} = { ∈ N | mod 2 = 0} ∩ { ∈ N | mod 3 = 0} one splits the requirement into the two parts connected by and. • To show that 6 is in the rst set we have to show that 6 mod 2 = 0, and we may conclude this from 6 = 3 · 2 + 0, Fact 2 and the denition of mod. • To show that 6 is in the second set we have to show 6 mod 3 = 0, and we may conclude this in the same way from 6 = 2 · 3 + 0. Overall this means that 6 is in the intersection as required. This usage of ‘and’ may also be observed in every-day language: If I state ‘it is cloudy and it is raining’ then I am claiming that both of the following statements are true: It is cloudy. It is raining. Example 2.3. In order to check that a rst year student in the department satises the degree requirement, and is enrolled on COMP10120 as well as being enrolled on COMP16321 I have to do both, • check that the student is enrolled on COMP10120 and • check that the student is enrolled on COMP16321. Example 2.4. In order to establish /∈ ∩ it is sucient to show one of /∈ /∈ . In general in order to argue that a statement of the form (Clause 1 and Clause 2) does not hold it is sucient to show that one of the two clauses fails to hold. Example 2.5. In order to argue that it is not the case that 3 is a prime number and 3 is even it is sucient to be able to state that since 3 leaves the remainder of 1 when divided by 2 it is not even by Denition 4. 67 Keyword Or We connect two statements or properties with ‘or’ if at least one (but possibly both of them) hold. To use the Google search box as an example once again, if you type two entries separated by ‘OR’ it will look for pages which contain one of the two words. This is also a fairly standard database query: You might be interested in all the customers for whom you have a landline or a mobile phone number, or all the ones who have ordered product or product because you have an accessory to oer to them. Example 2.6. Again a simple example is given by sets, namely by the denition of the union of two subsets and of a set , which is given by { ∈ | ∈ or ∈ }. For a concrete version of this see the following example. Example 2.7. In order to show that 6 is an element of { ∈ N | mod 2 = 0 or mod 3 = 0} it is sucient to prove one of the two parts. It is therefore sucient to state that • Since 6 = 3 · 2 + 0 we have that 6 mod 2 = 0 by the denition of mod . It is not necessary to check the other clause. This suggests a proof strategy: Look at both cases separately, and stop when one of them has been established. Again this usage is well established in informal language (although usage tends to be less strict than with ‘and’). If I say ‘Tomorrow I will go for a walk or a bicycle ride’ then I expect (at least) one of the following two sentences to be true: Tomorrow I will go for a walk. Tomorrow I will go for a bicycle ride. There is no information regarding which one will occur—and indeed I may nd the time to do both! Example 2.8. If the degree rules state that a student on the Computer Science with Mathematics programme must take one of COMP11212 and COMP13212 and COMP15212 then I can stop checking once I have seen that the student is enrolled on COMP11212. Note that in informal language ‘or’ often connects incompatible statements, which means that at most one of them is true. In mathematics two statements connected with ‘or’ may be incompatible, but they may well not be, and typically they aren’t. If we wanted to express the idea of two statements being incompatible we would have to say ‘Exactly one of Statement 1 and Statement 2 holds’. In order to show that a statement of the form (Clause 1 or Clause 2) does not hold we have to show that neither of the clauses holds. 68 Example 2.9. In order to show that /∈ ∪ we have to establish both. /∈ and /∈ . Example 2.10. In order to show that 7 is not an element of { ∈ N | mod 2 = 0 or mod 3 = 0} I have to argue in two parts: • Since 7 mod 2 = 1 ̸= 0 we see that 7 does not satisfy the rst condition. • Since 7 mod 3 = 1 ̸= 0 we see that 7 does not satisfy the second condition. Hence 7 is not in the union of the two sets. Keyword Implies This is a phrase that tells us that if the rst statement holds then so does the second (but if the rst statement fails to hold we cannot infer anything about the second). For this notion it is harder to nd examples outside of mathematics, but you might want to ensure that in your database, the existence of an address entry for a customer implies the existence of a post code. Again a simple formal example can be found by looking at sets. Example 2.11. If and are subsets of some set then ⊆ means that given ∈ , ∈ implies ∈ . In order to establish that ⊆ given ∈ one only has to show something in case that ∈ , so usually proofs of that kind are given by assuming that ∈ , and then establishing that ∈ also holds. We look at a concrete version of this. Example 2.12. To show that { ∈ N | mod 6 = 0} ⊆ { ∈ N | mod 3 = 0} we pick an arbitrary element in the former set. Then mod 6 = 0, and by the denition of mod this means that there is ∈ N with = 6 + 0. 69 But that means that = 6 = (2 · 3) = (2)3 using Fact 1, and so picking = 2 we can see that = 3 + 0, which means that mod 3 = 0. Typically we do not use ‘implies’ in informal language, but we have construc- tions that have a similar meaning. I might state, for example, ‘if it rains I will stay at home’. So if on the day it is raining you should not expect to meet me, you should expect me to be at home. Note that this does not allow you to draw any conclusions in the case where it is not raining, although many people tend to do so. (‘But you said you wouldn’t be coming only if it was raining . . . ’.) If I say ‘if it is raining I will denitely stay at home’ I’ve made it clearer that I reserve the right to stay at home even if it is not raining. Note that in the formal usage of ‘implies’ the meaning is completely precise. Example 2.13. One might use implication to state that if a student is enrolled on the rst year of a particular degree programme, for example Computer Science, then they must be enrolled on a particular course unit, for example COMP11120. In order to show that this is true one has to check that the statement holds for every student on that programme. In order to show that a claimed implication does not hold one has to nd an instance where the rst clause holds while the second does not. So in order to catch me out as having said an untruth in the above example, you’ve got to nd me out of the house when it is raining at the appointed time. Example 2.14. In order to refute the claim that, for a natural number , divisible by 2 implies n divisible by 6 we need to nd a number that • fulls the rst part, that is, it must be divisible by 2, but • does not full the second part, that is, is not divisible by 6. Since 4 = 2 · 2 the number 4 is divisible by 2 according to Denition 3. Since there is no number ∈ N with 4 = 6. the number 4 is not divisible by 6, which establishes that the claimed implication is false. Key phrase If and only if This phrase is merely a short-cut. When we say that Statement 1 (holds) if and only if Statement 2 (holds) then we mean by this that both, Statement 1 implies Statement 2 and Statement 2 implies Statement 1. 70 To prove for two sets , ⊆ that = is equivalent to showing that for all ∈ , we have ∈ if and only if ∈ , which is equivalent to showing that ⊆ and ⊆ . Example 2.15. To show that = { ∈ N | mod 6 = 0} is equal to = { ∈ N | mod 2 = 0} ∩ { ∈ N | mod 3 = 0} we show that ⊆ and that ⊆ . It is a good idea to optically structure the proof accordingly. ⊆ . Given ∈ we know that mod 6 = 0 which means that we can nd ∈ N with = 6. This means that both • = (3)2 and so mod 2 = 0 and • = (2)3 and so mod 3 = 0 and so ∈ . ⊆ . Given ∈ we know that both • mod 2 = 0, which means that there is ∈ N with = 2 and • mod 3 = 0, which means that there is ∈ N with = 3. This means that 2, which is a prime number, divides = 3. By Denition 17 this means that • 2 divides 3 (which clearly does not hold) or • 2 divides , which means that there exists ∈ N with = 2. Altogether this means that = 3 = 2 · 3 = 6, and so is divisible by 6. Quite often when proving an ‘if and only if’ statement the best strategy is to prove the two directions separately. The only exception is when one can nd steps that turn one side into the other, and every single step is reversible. In order to show that an if and only if statement does not hold it is sucient to establish that one of the two implications fails to hold. 71 Key phrase For all Again a phrase that is very common in mathematical denitions or arguments, but there are other uses. For example you might want to ensure that you have an email address for every customer in your database. Example 2.16. Consider the following statement. for all elements of {4 | ∈ N} we have that is divisible by 2. In order to show that this is true I have to assume that I have an arbitrary element of the given set. In order for to be in that set it must be the case that there exists ∈ N such that = 4. But now = 4 = 2(2) and according to Denition 3 this means that is divisible by 2. Since an arbitrary element of the given set satises the given condition, they must all satisfy it. A ‘for all’ statement should have two parts. • For which elements are making the claim? There should be a set associated with this part of the statement (this is N in the previous example). • What property or properties do these elements have to satisfy? There should be a statement which species this. In the previous example it is the statement that the number is divisible by 2. Typically when proving a statement beginning with ‘for all’ one assumes that one has an unspecied element of the given set, and then establishes the desired property. Example 2.17. Looking back above at the statement that one set is the subset of another, we have, strictly speaking, suppressed a ‘for all’ statement. Given subsets and of a set the statement ⊆ is equivalent to for all ∈ ∈ implies ∈ . In the proof in Example 2.12 we did indeed pick an arbitrary element of the rst set, and then showed that it is an element of the second set. Example 2.18. A nice example of a ‘for all’ statement is that of the equality of two functions with the same source and target. Let : and : be two functions. Then = 72 if and only if for all ∈ we have = . We look at a concrete version of this idea. Example 2.19. Consider the following two functions. : N N 2( div 2) : N N {︃ is even − 1 else. To show that the two functions are equal, assume we have an arbitrary element of the source set N. The second function is given in a denition by cases, and usually it is easier to also split the proof into these two cases. • Assume that is even, which means that mod 2 = 0 by Denition 4. In this case = 2( div 2) def = 2( div 2) + 0 0 unit for addition = 2( div 2) + mod 2 even = Lemma 0.1 = def . • Assume that is not even, which means that mod 2 = 1 by Deni- tion 4. In this case = 2( div 2) def = 2( div 2) + 0 0 unit for addition = 2( div 2) + 1− 1 1− 1 = 0 = 2( div 2) + mod 2− 1 not even = − 1 Lemma 0.1 = def . In every-day language you are more likely to nd the phrase ‘every’ instead of ‘for all’. Mathematicians like to use phrases that are a little bit dierent from what is common elsewhere to draw attention to the fact that they mean their statement in a formal sense. Example 2.20. ‘Every rst year computer science student takes COMP11120’ is a claim that is a ‘for all’ statement. In order to check whether it is true you have to go through all the rst year students in the department and check whether they are enrolled on this unit. In order to show that a statement beginning ‘for all’ does not hold it is sucient to nd one element of the given set for which it fails to hold. 73 Example 2.21. In the previous example, if you can nd one student in computer science who is not enrolled on COMP11120 then you have shown that the statement given above does not hold. Example 2.22. In order to refute the claim that for all natural numbers and it is the case that − = − , it is sucient to nd one counterexample, so by merely writing 2− 1 = 1 ̸= −1 = 1− 2, we have proved that the claim does not hold. Key phrase There exists This is a phrase that is frequently found both in mathematical denitions and arguments. Example 2.23. The denition of divisibility (compare Denition 3) is divisible by if and only if there exists ∈ N such that = . is an example of a ‘there exists’ statement. Whenever a statement is made about existence there should be two parts to it: • Where does the element exist? There should always be a set associated with the statement. In the above example, had to be an element of N (and indeed the existence of some ∈ R with the same property would completely change the denition and make it trivial). • What are the properties that this element satises? There should always be a statement which species this. In the example above the property is = (for the given and ). One proves a statement of this form by producing an element which satises it, which is also known as a witness. Example 2.24. To show that 27 is divisible by 9, by Denition 3, I have to show that there exists ∈ N with 27 = 9. To show this I oer = 3 as a witness, and observing that 9 · 3 = 27 veries that this element has the desired property. 74 To go back to the database example you might wonder whether you have a customer in your database who lives in Italy, or whether you have a customer who is paying with cheques (so that you can inform them that you will no longer accept these as a payment method). Again in every-day language the phrase ‘there is’ is more common than ‘there exists’. The latter serves to emphasize that a statement including is should be considered a precise mathematical statement. Example 2.25. ‘There is a student who is enrolled on both, COMP25212 as well as MATH20302’ may have implications for the timetable. Example 2.26. In order to show that there exists a number which is both, even and prime, it is sucient to supply the witness 2 together with an argument that it is both, even and prime. Often the diculty with proving a ‘there exists’ statement is nding the witness, rather than with proving that it has the required property. Sometimes instead of merely demanding the existence of an element we might demand its unique existence. This is equivalent to a quite complex statement and is discussed below. In order to show that a statement beginning ‘there exists’ does not hold one has to establish that it fails to hold for every element of the given set. So to demonstrate that the statement above regarding students does not hold you have to check every single second year student. Key phrase Unique existence We sometimes demand that there exists a unique element with a particular property. This is in fact a convenient shortcut. There exists a unique ∈ with the property holds if and only if there exists ∈ with property and for all , ′ ∈ , if and ′ satisfy property then = ′. Example 2.27. If : is a function from the set to the set then we know that the function assigns to every element of an element of , and this means that for every ∈ there exists a unique5 ∈ with = . Uniqueness is important here: We expect that a function, given an input value, produces precisely one output value for that input. So if we have valued and ′ in which both satisfy the statement then we have = = ′, and so = ′. This idea is used in characterizing graphs of functions, see Denition 14. 5Note that this is a dierent statement from either Denition 22 or Denition 23—in exams students sometimes get confused about this. 75 Key phrases: Summary We have the key ingredients that formal statements are made of, namely the key phrases which allow us to analyse their structure. Analysing the structure of a statement allows us to construct a blueprint for a proof of that statement. The key ideas are given in the text above; we give a summary in form of Table 2.1. By ‘counterproof’ we mean a proof that the statement does not hold. In the table , 1 and 2 are statements, possibly containing further key phrases. statement proof counterproof 1 and 2 proof of 1 and proofof 2 counterproof for 1 or counterproof for 2 1 or 2 proof of 1 or proofof 2 counterproof for 1 and counterproof for 2 if 1 then 2 assume 1 holds and prove 2 nd situation where 1 holds and 2 does not for all , assume an arbitrary is given and prove for that give a specic and show does not hold for that there is such that nd a specic and show that holds for that assume you have an arbitrary and show does not hold for that Table 2.1: Key phrases and proofs Tip Every statement we might wish to prove, or disprove, is constructed from the key phrases. In order to nd a blueprint for a proof, or counterproof, all we have to do is to take the statement apart, and follow the instructions from Table 2.1. We give a number of additional examples for more complex statements in the following sections. Note in particular the proof of Proposition 2.1 as an example of a lengthy proof of this kind. One shouldn’t think of the above as mere ‘phrases’—they allow us to construct formal statements and come with a notion of how to establish proofs for these. This is what mathematics is all about. We look at an even more formal treatment of these ideas in the material on logic, which is taught after we are nished with the current chapter. In the following sections we look at examples for such statements which give denitions which are important in their own right. The aim is for you to become familiar with the logical constructions as well as learning about the given example. 2.3 Properties of numbers We begin by giving examples within some sets of numbers. You will need to use the denitions and properties from Chapter 0 here. 76 In the examples that follow on the left hand side we give running commentary on how to construct the proof that appears on the right hand side. Example 2.28. We prove the following statement for integers , and : If divides then divides · . This is an ‘if . . . then’ statement. Table 2.1 above tells us we should assume the rst statement holds. Assume that and are integers and that divides . Sooner or later we have to apply the formal denition of divides to work out what this means. By Denition 3 this means that there exists an integer such that · = . It is usually a good idea to write down what we have to prove, again expanding the denition of ‘di- vides’. We have to show that divides · , and by Denition 3 we have to show that there exists an integer such that · = · . We have to establish a ‘there exists’ statement, and Table 2.1 tells us we have to nd a witness for which the statement is true. At this point one usually has to stare at the state- ments already written down to see whether there is an element with the right property hidden among them. We have · = ( ·) · assmptn = · ( · ) Fact 1. and so we have found an integer, namely · with the property that we may multiply it with to get · . Example 2.29. Assume we are asked to prove for all ∈ N, 2 is even, Table 2.1 says to show a statement of the form ‘for all . . . ’ we should assume we have a natural number . Let be in N. So far so good. What about the statement 2 is even? At this point one should al- ways look up the formal denition of the concepts used in the statements. We have to show that 2 is even, by Denition 4 this means we have to show that 2 divides 2. So now we have put in the denition of evenness, but that leaves us with divisibil- ity, so we put in that denition. By Denition 3 we have to show that there is ∈ N with 2 = 2. We pick = and so the claim is established. Sometimes you are not merely asked to prove or disprove a statement, but you rst have to work out whether you should do the former or the latter. This changes the workow a little. 77 Example 2.30. Assume we are asked to prove or disprove the following state- ment for integers , and . If divides , and divides , then · divides . Now we rst have to work out whether we want to prove the statement, or nd a counterproof. Usually it’s a good idea to do some examples. 2 divides 6 and 3 divides 6, and 2 · 3 = 6 divides 6, but 2 divides 2 and 2 divides 2, whereas 2 · 2 = 4 does not divide 2, so this statement is false. But what does a formal argument look like in this case? Table 2.1 tells us that it is sucient to nd one way of picking , , and which makes the claim false. We show how to use the counter- example we found informally to form- ally establish that the statement does not hold. We note that 2 · 1 = 2, and so 2 divides 2 by Deni- tion 3. We pick = = = 2. For those choices, the above es- tablishes that divides and that divides . But · = 4 and this number does not divide = 2, hence the statement is false. Example 2.31. Assume we are given the statement there is an ∈ Z ∖ {0, 1} such that + = −( · ). and are asked to prove or disprove it. Do we believe the statement? If we try 2 + 2 we get 4, but −(2 · 2) = −4, and clearly we get a sign mismatch if we use any positive integer. But what about = −2? Table 2.1 tells us that all we have to do to give a proof for a ‘there exists’ state- ment is to nd one witness for which the claim is true. If we set = −2 then we have + = −2 + (−2) def = −4 arithmetic = −(2 · 2) arithmetic = −( · ) def . as required. 78 Exercise 23. Prove or disprove the following statements about divisibility for integers making sure to use Denition 3. Assume that , and and are integers. Follow the examples above in style (you don’t have to give the running commentary). (a) If divides and divides then · divides · . (b) If 2 divides · then divides and divides . (c) If divides and divides then divides . (d) If divides and divides then = . Here is a denition of a number being prime that will look dierent from the one you have seen before. The aim of this is to encourage you to follow the given formal denition, and not your idea of what it should mean. Denition 17: prime An element ̸= 1 ofN (or ̸= ±1 in Z) is prime if and only if for all elements and of N (or Z) it is the case that divides implies divides or divides . Example 2.32. Assume that we have the statement for all ∈ N ∖ {0, 1}, is prime or is a multiple of 2. Do we believe the statement? Well, 0 and 1 have been excluded, so let’s look at the next few numbers. We have that 2 is prime, 3 is prime, 4 is a multiple of 2, 5 is prime. . . This looks good, but do we really be- lieve this? Are there really no odd num- bers which are not prime? The number 9 comes to mind. So we want to give a counterproof. Table 2.1 tells us that we are looking for one such that the statement does not hold. Let = 9. 79 To give a counterproof we have to show that 9 does not satisfy the claim. The two statements are connected with ‘or’, so according to Table 2.1 we have to show that neither holds. We note that 9 is not prime since 9 = 3 · 3, so 9 divides 3 · 3 but 9 does not divide 3, which means that 9 does not satisfy Denition 17. If 9 were even it would have to be divisible by 2 according to Den- ition 4. but since 9 mod 2 = 1, Denition 3 tells us that this is not the case. Hence this is a counterexample to the claim. Exercise 24. Establish the following claims for prime numbers using Deni- tion 17, and denitions and facts from Chapter 0. (a) Show that if an element of N is prime then for all ∈ N we have divides implies = 1 or = . (b) Show that if and are prime in N, ̸= , and is any natural number then divides and divides implies divides . Compare this statement and its proof with Example 2.30. (c) Show that if is prime in Z then divides implies = ±1 or = ± Note that the converse of (b) and (c) are also true, that is, our denition of primeness is equivalent to the one you are used to. However, the proof requires a lot more knowledge about integers than I want to ask about here. Example 2.33. Assume we are given the statement There exists ∈ Z such that is a multiple of 3 and is a power of 2. 80 Do we believe the statement? A bit of thinking convinces us that powers of 2 are only divisible by powers of 2, so they cannot be divisible by 3 and therefore they cannot be a multiple of 3. But how do we show that such a number cannot exist? This is a situation where what counts as a formal proof very much depends on what properties one may use. The cleanest proof is via the prime factorization of integers (or corollaries thereof), but that is more than I want to cover in these notes. In a situation where you cannot see how to write down a formal proof you should write something along the lines of the rst paragraph written here. Never be afraid of expressing your thoughts in plain English! If you were to start a formal proof it would look like something on the right. Assume that is a power of 2, that is, there exists ∈ N such that = 2. If is a multiple of 3 then there is ∈ Z such that = 3. Hence 2 = = 3. This is where you would like to use that 3 can- not divide 2 , but this requires a fact that is not given in Chapter 0. So the best you can do is to write what I wrote on the right. This means that 3 di- vides 2, which is im- possible. CExercise 25. Which of the following statements are valid? Try to give a reason as best you can, following the previous examples. You should use the denitions from Chapter 0 for the notions of evenness and divisibility (and there is a formal denition of primeness above, but for this exercise you may use the one you are familiar with). (a) For all ∈ N, is even or is odd. (b) There exists ∈ N such that is even and is a prime number. (c) There exists a unique ∈ Z such that is even and is a prime number. (d) For all ∈ Z, is divisible by 4 implies is divisible by 2. (e) For all ∈ Z, is odd implies mod 4 = 1 or mod 4 = 3. (f) There exists ∈ N such that is even implies is odd. (g) For all ∈ Z ∖ {−1, 0, 1, 3} there exists in Z such that div = 2. Examples for treating more complex statements, and giving more formal proofs, are given in the following sections. 81 2.4 Properties of Sets and their Operations We use this opportunity to give more sample proofs for sets, but also note in particular the proof of Proposition 0.3 and Examples 2.12 and 2.15. Example 2.34. Let , ′ and be subsets of a set . We show that if ⊆ ′ then ∪ ⊆ ′ ∪ . In order to show that one set is a subset of another we have to show that every element of the rst set is one of the second. So, as suggested by Table 2.1 we begin by assuming we have an arbitrary element of the rst set. Let ∈ ∪ . By denition of ∪ this means that ∈ or ∈ . In the rst case we know that ∈ ⊆ ′, so ∈ ′, and in the second case we stick with ∈ . Hence the statement above implies that ∈ ′ or ∈ , which is equivalent to ∈ ′ ∪ by the denition of ∪. Exercise 26. Let , ′ and be subsets of a set . Assume that ⊆ ′. Show the following statements. (a) ∩ ⊆ ′ ∩ . (b) ∖ ⊇ ∖ ′. More proofs involving sets and their operations are given below, see in partic- ular Examples 2.36 and 2.39. 2.5 Properties of Operations Functions that appear very frequently are operations on a set. Usually we are interested in binary operations on a set , that is functions × . Examples of such functions are • addition and multiplication for N, • addition, and multiplication forZ,Q,R orC, as well as the derived operation of subtraction, • union and intersection of sets as functions from × to , • concatenation of strings in Python; 82 • concatenation of lists (see Section 6.1) over some set. Note that we cannot dene a division operation for rational, real or complex numbers, in the way that we dene subtraction. We may not divide by 0 and so we can only dene division as a function where the source has been adjusted, for example, R× (R ∖ {0}) R (, ) · −1, where −1 is our notation for the multiplicative inverse of . These are operations we use all the time, and they are deserving of further study. Note that we typically write binary operations in inx notation, that is, we write the operation between its two arguments, such as + ′, · ′. For what follows we need an arbitrary binary operation, where we make no assumptions about the kind of operation, or the set it is dened on. For that we use the symbol ~. Denition 18: associative A binary operation ~ on a set is associative if and only, for all , ′, ′′ in it is the case that (~ ′)~ ′′ = ~ (′ ~ ′′). Why is this important? We use brackets to identify in which order the opera- tions should be carried out. We can think of the two expressions as encoding a tree-like structure (known as a parse tree6), which tells us in which order to carry out the operations present in the expression. (~ ′)~ ′′ ~ ~ ′ ′′ ~ (′ ~ ′′) ~ ~ ′ ′′ Example 2.35. Recall that − is a shortcut for calculating +(−), Using that derived operation as an example, we illustrate how one can think of this as allowing the lling in of the various steps of the calculation: 6Parse trees are studied in detail in COMP11212. 83 −− 3 4 5 becomes −1− 5 = −6 3− 4 = −1 3 4 5 whereas − 3 − 4 5 becomes 3− (−1) = 4 3 4− 5 = −1 4 5 Knowing that an operation is associative means that both trees evaluate to the same number and therefore we may leave out brackets when using such an operation. It is safe to write ~ ′ ~ ′′ for such an operation. This is important to computer scientists for two main reasons: • When writing a program, leaving out brackets in this situation makes the code more readable to humans. • When writing a compiler for a programming language, knowing that an operation is associative may allow signicantly faster ways of compiling. Note that if we write our operation as a binary function : × where we use prex notation then associativity means that the following equality holds: ((, ′), ′′) = (, (′, ′′)) Example 2.36. Assume we are given a set . Recall from Section 0.2.4 that we may think of the union operation as a function ∪ : × . We show that this operation is associative. The statement we wish to show is a ‘for all’ statement. Following Table 2.1 we assume that , ′ and ′′ are (arbitrary) elements . We calculate ( ∪ ′) ∪ ′′ = { ∈ | ∈ or ∈ ′} ∪ ′′ def union 84 = { ∈ | ( ∈ or ∈ ′) or ∈ ′′} def union = { ∈ | ∈ or ∈ ′ or ∈ ′′} common sense = { ∈ | ∈ or ( ∈ ′ or ∈ ′′)} common sense = ∪ { ∈ | ∈ ′ or ∈ ′′} def union = ∪ (′ ∪ ′′) def union Note that we have justied each step in the equalities used above—this ensures that we check we only use valid properties, and tells the reader why the steps are valid. Note that we had to invoke ‘common sense’ in the example—usually this means that we are relying on denitions that are not completely rigorous mathematically speaking. What we have done in the denition of the union of two sets is to rely on the meaning of the English language. Only when we are down to that is it allowable to use ‘common sense’ as a justication (you might also call it ‘the semantics of the English language’). In formal set theory there is formal logic to dene the union of two sets, but we do not go to this level of detail here. Example 2.37. Example 2.35 establishes that the derived operation of subtrac- tion is not associative for the integers since it shows that (3− 4)− 5 ̸= 3− (4− 5), and to refute a ‘for all’ claim we merely need to give one counterexample. Since we so far do not have formal denitions of addition and multiplication for N, Z, Q and R it is impossible to formally prove that these are indeed associative. You may use this as a fact in your work on this unit, apart from when you are asked to formally prove them in Chapter 6.7 CExercise 27. Work out whether the following operations are associative. (a) Intersection for sets. (b) Addition for complex numbers. (c) Subtraction for complex numbers. (d) Multiplication for complex numbers. (e) Dene the average ave of two real numbers , ′ as ave(, ′) = + ′ 2 . Is this operation associative? Would you apply it to calculate the average of three numbers? If not, can you think of a better averaging function? 7Note that formal denitions, and proofs, of these properties for the natural numbers are given in Section 6.4. 85 (f) Multiplication of real numbers where every number is given up to one post- decimal digit, and where rounding takes place every time after a multiplication has been carried out.8 (g) The concatenation operator for strings (as, for example, implemented as + in Python). (h) The and operator for boolean expressions in Python. (i) Let be a set and let Fun(, ) be the set of all functions with source and target . Show that composition is an associative operation on the set Fun(, ). Some operations allow us even greater freedom: Not only is it unnecessary to provide brackets, we may also change the order in which the arguments are supplied. Denition 19: commutative A binary operation ~ on a set is commutative if and only if, for all and ′ in we have ~ ′ = ′ ~ . If an operation is commutative then it does not matter in which order argu- ments are supplied to it. Hence the two trees below will evaluate to give the same result. ~ ′ ~ ′ ′ ~ ~ ′ Example 2.38. We know that when we have natural numbers and then + + + + have the same number at the root of the tree, and so addition is a commut- ative operation. Example 2.39. As in Example 2.36 we look at the union operation on the powerset for a given set . We show that this operation is commutative. Once more this is a statement of the ‘for all . . . ’ kind. Following Table 2.1 8When programming there is usually limited precision, and rounding has to take place after each step of the computation. While a computer has more precision, say for oating point numbers, the problems that occur are the same as here. 86 once again we assume that we have (arbitrary) elements and ′ of . The union of and ′ is dened as follows: ∪ ′ = { ∈ | ∈ or ∈ ′} and, once again invoking ‘common sense’, this is the same as { ∈ | ∈ ′ or ∈ } = ′ ∪ . Alternatively we can argue with more of an emphasis on the property of elements of the given sets: ∈ ∪ ′ if and only if ∈ or ∈ ′ def ∪ if and only if ∈ ′ or ∈ logic if and only if ∈ ′ ∪ . Example 2.40. Consider the following9 operation for complex numbers: Given and ′ in C we set ~ ′ = ′. The question is whether this operation is commutative. First of all we have to work out whether we think it is true, and should try to prove it, or whether we should aim for a counterproof. There are two approaches here: You can write down what this operation does in terms of real and imaginary parts which approach we follow in the following example, or you can think for a moment about what the conjugate operation does. It aects the imaginary part only, so if we have the product of one number with imaginary part 0, and one with imaginary part other than 0, there should be a dierence. This suggests we should try a counterproof, that is, we should nd one choice for , and one for ′, such that the statement becomes false. The simplest numbers tting the description given above, and which are distinct from 0, are 1 and . We check ~ 1 = · 1 = − · 1 = −, and 1~ = 1 · = 1 · = . Since − ̸= we have established that the given operation is not commutative. Example 2.41. We give an alternative solution to the previous example. We calculate that (+ )~ (′ + ′) = (+ )′ + ′ = (+ )(′ − ′) = (′ + ′) + (−′ + ′), 9This appeared in a past exam paper. 87 whereas (′ + ′)~ (+ ) = (′+ ′) + (−′+ ′) = (′ + ′) + (′ − ′) = (′ + ′)− (−′ + ′). So the two resulting numbers will have the same real part, but their imaginary parts will be the negatives of each other. Now it is important to remember that it is sucient to nd just one counterexample, and it is best to keep that as simple as possible. We pick = 0 = 1 ′ = 1 ′ = 0, and verify that this means (+ )~ (′ + ′) = and (′ + ′)~ (+ ) = −. CExercise 28. Work out whether the following operations are commutative. If you think the answer is ‘yes’, give a proof, if ‘no’ a counterexample. (a) Multiplication for complex numbers. (b) Subtraction for integers. (c) Division for real numbers dierent from 0. (d) Set dierence on some powerset. (e) The ave function from the previous exercise. (f) The concatenation operator for strings as for example implemented by + in Python. (g) The and operator for boolean expressions in Python. Some operations have an element which does not have any eect when com- bined with any other. Denition 20: unit Let ~ be a binary operation on a set . An element of is a10 unit for ~ if and only if it is the case that for all elements of we have ~ = = ~ . If we want to picture this using a tree then it is saying that 10This is sometimes also known as the identity for the operation, but that terminology might create confusion with the identity function for a set. 88 ~ and ~ become This looks odd, but if you think of the rst two trees as being part of a larger tree then this becomes a useful simplication rule. Example 2.42. Knowing that 0 is the unit for addition for the integers we may simplify the tree on the left to become the tree on the right. + −3 + 0 1 + −3 1 Example 2.43. We have already seen a number of examples of units. The number 0 is the unit for addition on all the sets of numbers we cover in these notes. This is one of the statements from Fact 1 (and corresponding facts about the other sets of numbers), since for all ∈ N we have + 0 = = 0 + . Example 2.44. If we look at the intersection operation for subsets of a given set , ∩ : (, ′) ∩ ′, we can show that is the unit of this operation. For that we have to calculate, given an arbitrary subset of , ∩ = { ∈ | ∈ and ∈ } def ∩ = { ∈ | ∈ } = . and ∩ = { ∈ | ∈ and ∈ } def ∩ = { ∈ | ∈ } = . Hence the claim is true. Working out whether a unit exists for some operation can be tricky. The existence of a unit is equivalent to the statement there exists ∈ such that for all ∈ ~ = = ~ . 89 By Table 2.1 to refute such a statement we have to show that for all ∈ there exists ∈ (~ ̸= or ̸= ~ ). Statements like this are quite tricky to prove. The next two examples show how one might argue in such a situation. The strategy is to deduce properties that would have to have (if it existed), and to then argue that an element with such properties cannot exist. Example 2.45. Consider subtraction for integers, where − is a shortcut for + (−). Does this operation have a unit? Once again, we rst have to decide whether we should try to give a proof or a counterproof. The statement in question is of the kind ‘there exists . . . ’. To prove such a statement we have to give an element with the required property. In a situation where we’re not sure what such an element might look like, it is often possible to derive properties it needs to have. This is the strategy we follow here. If the number we have were a unit for subtraction we would require − = for all elements of Z. The only number which satises this is = 0, but if we calculate 0− 1 = −1, we see that this element cannot be the unit since we would require that number to be equal to 1 to satisfy − = for all ∈ N. Hence the given operation does not have a unit. Note that the subtraction operation satises none of our properties! For this reason it is quite easy to make mistakes when using this operation, and that is why it is preferable to not to consider subtraction a well-behaved operation. It is usually harder to establish that an operation does not have a unit, so we give another example for this case. Example 2.46. Let us recall the set dierence operation from Section 0.2 on for a given set which, for , ′ in , is dened as ∖ ′ = { ∈ | /∈ ′}. Does this operation have a unit? As in the previous example we derive prop- erties that such a unit would have to have. In order for ∖′ = to hold it must be the case that none of the elements of occurs in ′. In particular if were the unit we must have, instantiating as , ∖ = { ∈ | /∈ } = , which means that must necessarily be empty. But for the empty set we have ∅ ∖ = { ∈ ∅ | /∈ } = ∅, but for ∅ to be the unit this would have to be equal to . This means that no element of can satisfy the requirements for a unit for this operation. The following exercise asks you to identify units for a number of operations, if they exist. 90 EExercise 29. Identify the unit for the following operations, or argue that there cannot be one: (a) Union of subsets of a given set . (b) Multiplication for integers, rational, real and complex numbers. (c) The operation from Example 2.40. (d) The ave operation for the preceding two exercises. (e) The concatenation operation for strings as, for example, implemented by + in Python. (f) The and operator for boolean expressions in Python. Note that mathematicians call a set with an associative binary operation which has a unit a monoid. Exercise 30. Prove that there is at most one unit for a binary operation ~ on a set . Hint: Assume you have two elements that satisfy the property dening the unit and show that they must be equal. Exercise 31. Consider the set Fun(, ) of all functions from some set to itself. This has a binary operation in the form of function composition. If you have not already done so in Exercise 27 then show that this operation is associative. Find the unit for the operation. Conclude that we have a monoid. Further show that the operation is not commutative in general. Denition 21: inverse element Let ~ be an associative binary operation with unit on a set . We say that the element ′ is an inverse for ∈ with respect to~ if and only if we have ~ ′ = = ′ ~ . Note that if −1 is the inverse for with respect to ~ then is the inverse of −1 with respect to ~ since this denition is symmetric. It is standard11 to write −1 for the inverse of , but that convention changes if one uses the symbol + for the operation. In that case one writes − for the inverse of the element with respect to the operation +. Example 2.47. For addition on the integers the the inverse of an element is −, since + (−) = 0 = −+ , and 0 is the unit for addition. The same proof works for the rationals and the 11This is the usual notation for Q, R and Z. 91 reals. For the complex numbers we have dened −( + ) = − − , and shown that this is the additive inverse for + in Exercise 1.1. Example 2.48. For addition on the natural numbers 0 is the unit for addition, but inverses do not exist in general. The number 0 is the only number that has an inverse.12 Exercise 32. Show that if~ is a binary operation on the set with unit then is its own inverse. Example 2.49. For the rational or real numbers the multiplicative inverse of an element ̸= 0 is −1 = 1/. Note that when you use −1, or divide by you must include an argument that is not 0! Example 2.50. In Chapter 1 we have proved that inverses exist for both, ad- dition and multiplication for complex numbers, and we have shown how to calculate them for a given element. Recall that if you want to use −1 you must include an argument that this exists,13 that is, that ̸= 0. Example 2.51. The proof that inverses for addition exist for integers, rationals, or reals, is very short: Given such a number , we are so used to the fact that + (−) = 0 = − + that it hardly feels as if this is a proof! Example 2.52. To show that a given operation does not have inverses for every element one has to produce an element which does not have an inverse. Assume that is a set. Consider the intersection operation, ∩ : × (, ′) ∩ ′. The unit for this operation is given by as established in Example 2.44. We show that the empty set does not have an inverse: If were an inverse for ∅ with respect to ∩ it would have to be the case that ∩ ∅ = . But ∩ ∅ = ∅, and so as long as is non-empty, an inverse cannot exist. Exercise 33. For the following operations, give an argument why inverses do not exist. (a) Union of subsets of a given set. (b) The ave function from the previous exercises. (c) The concatenation operation for strings. 12Think about why that is. 13Students have lost marks in exams for just dividing by some number without comment. 92 (d) The and operation for boolean expressions in Python. EExercise 34. Let be a set with an associative binary operation ~, and assume that ∈ is the unit for that operation. (a) Show that if 1 and 2 have inverses then the inverse for the element 1 ~ 2 is given by −12 ~ −11 . (b) Show that every element has at most one inverse. Hint: Assume that there are two inverses and prove that they have to be the same. Note that mathematicians call a set with an associative binary operation with a unit, and where element has an inverse, a group. Groups are very nice mathem- atical entities, but most of the sets with a binary operation you will see will not have the full structure of a group (typically lacking inverses). Optional Exercise 6. Assume that is a set with a binary operation ~ which is associative and has a unit. Consider the set Fun(,) of all functions from some set to . Given two elements, say and of Fun(,), we dene a new function which we call ~ in Fun(,) by dening for, ∈ , ( ~ ) = ~ , (in other words the result of applying the new function to the argument is to apply both, and to and to combine the results by using the binary operation on . This is known as dening an operation pointwise on as set of functions. Find the unit for this operation and show that it is one. If the operation on is commutative, what about the one on Fun(,)? 2.6 Properties of functions Functions allow us to transport elements from one set to another. Section 0.3 gives a reminder of what you should know about functions before reading on. Recall Denition 14 which says that the graph of a function : is dened as {(, ) ∈ × | ∈ }. This is the set we typically draw when trying to picture what a function looks like, at least for functions from sets of numbers to sets of numbers. The typical case for that is for and to be subsets of R. We can characterize all those subsets of × which are the graph of a function of the type → . Proposition 2.1 A subset of × is the graph of a function from to if and only if for all ∈ there exists a unique ∈ with (, ) ∈ . 93 This statement requires a proof. We give one here as another example for how to use the key phrases in the statement to structure the proof. We have an ‘if and only if’ statement, and we split the proof into two parts accordingly. • Assume that14 is the graph of a function. We would like to have a name for that function, so we call it , and note that if is its graph then = {(, ) | ∈ }. We have to show that has the given property. This is a statement of the form ‘for all . . . ’, so following Table 2.1 we assume that we have an arbitrary ∈ . We now have to establish the remainder of the given statement. This is a ‘unique existence’ property, which means we have to show two things: – Existence. In order to show a ‘there exists’ statement Table 2.1 tells us we must nd a witness for the variable, here , with the desired property. We know that (, ) is in the graph of , and so we have found a witness in the form of = for the existence part. – Uniqueness. A uniqueness proof always consists of assuming one has two elements with the given property and showing that they must be equal. Assume we have and ′ in so that (, ) and (, ′) are both elements of . We can see from the equality for given above that the only element with rst component in is the element (, ), and so we must have = = ′ and we have established the uniqueness part. • Assume that15 is a subset of × satisfying the given condition. We have to show that is the graph of a function, and the only way of doing this is to – dene a function and – show that is the graph of . We carry out those steps in turns. – We would like to dene a function : by setting : if and only if (, ) ∈ . We have to check that this denition produces a function, that is that there is precisely one output in for every input from . By the existence part of the assumed condition we know that for every ∈ there is at least one element of with (, ) ∈ and so there is indeed an output for every input. But by uniqueness we know that if (, ) and (, ′) are in then = ′, so there is at most one element for every ∈ . Hence given an input our function creates the unique output required. 94 – It remains to check that is the graph of , and for that we note that by Denition 14 the graph of a function is given as follows. {(, ) ∈ × | ∈ } = {(, ) ∈ × | (, ) ∈ } def = This completes the proof. Our concept of function from Chapter 0 says that a function : produces an output in for every input from . The proposition above tells us that this means that for every element of we have a unique element of , namely , which is associated with . Some functions have particular properties that are important to us. Denition 22: injective A function : is16 injective if and only if for all and ′ in = ′ implies = ′. Under these circumstances we say that is an injection. One way of paraphrasing17 this property is to say that two dierent elements of are mapped to two dierent elements of . This means that knowing the result ∈ of applying to some element of is sucient to recover . Example 2.53. The simplest example of an injective function is the identity function id : for any set . To prove this formally, note that we have a ‘for all’ statement, so we assume that and ′ are elements of . We have to prove an implication, so assume the rst part holds, that is, we have id = id′. But this implies that = id = id ′, and so we have = ′ as required. Example 2.54. The function from N to N given by : 2 is injective. To show this we have to show a ‘for all’ statement, so according to Table 2.1 we should assume that we have , ′ in N. To show an implication, 14This direction is sometimes known as the ‘forward’ (in the sense that it shows that the rst statement implies the second) or ‘only if’ direction. 15This direction is sometimes known as the ‘backwards’ or ‘if’ direction, in that it shows that the second given statement implies the rst. 16Note that some people call such functions ‘one-on-one’ instead. 17But usually not a good way of attempting a proof. 95 the same table tells us we should assume the rst part holds, so we assume that = ′. But by inserting the denition of this means that 2 = = ′ = 2′, and by multiplying both sides with the multiplicative inverse of 2 we may conclude that = ′. Example 2.55. On the other hand the function from R to R given by : 1 is not injective. In order to refute a ‘for all’ statement by Table 2.1 it is sucient to produce a counter-example. This means we have to nd two elements of R. say and ′, such that the given implication does not hold. The same table tells us that for the implication not to hold we must ensure that the rst condition is true, which here means that = ′, while the second condition is false, that is, we must have ̸= ′. This is quite easy for our function: The numbers = 0 and ′ = 1 are certainly dierent, but we have 0 = 1 = 1, so we have indeed found a counterexample. Example 2.56. Typical examples from the real world are unique identiers, for example, student id numbers. We would expect every student to have a unique id number which is not shared with any other student. This is certainly a desirable property, but to prove formally that it holds we would have to know how exactly the university assigns these numbers, and then we could check that. Nonetheless you should be able to work out whether real world assignments ought to be injective, and you should be able to come up with ways of testing this realistically, or write a program that conrms it. There are many other situations where we have to ensure this—for example, in a database we often want to have a unique key for every entry (for example the customer number). Also, when casting an element of some datatype to another we expect that if we cast an int to a double in Java that two dierent int values will be cast to dierent double values. This operation should be performable without losing any information. Example 2.57. Showing that a real world assignment is not injective has to be done by producing two witnesses. For example, the assignment that maps students to tutorial groups is not injective. To prove that all we have to do 96 is to nd two (dierent) students who are in the same tutorial group. You all know students like that. We can also think of an injection as a ‘unique relabelling’ function: Every element from the source set is given a new label from the target set in such a way that no two elements of are given the same label. The graph of a function can be useful when determining whether a function is injective. For the squaring function described above the graph looks like this. 0 Whenever we can draw a horizontal line that intersects the graph of our function in more than one place then the function is not injective: 0 −1 1 The -coordinates of the two intersection points give us two dierent elements of R where the function takes the same value, namely here for 1 and −1, see example 2.58. Note that one has to be careful when using the graph to determine whether a function is injective: Since most examples have an innite graph it is impossible to draw all of it, so one has to ensure that there isn’t any unwanted behaviour in the parts not drawn. Further note that a graph cannot provide a proof that a function is injective (or not), but it can help us make the decision whether we want to give a proof or a counterproof. Example 2.58. We show that the function whose graph is given above is not injective. Consider the function : R R 2. Injectivity is a ‘for all’ statement. To give a counterproof by Table 2.1 all we have to do is to nd witnesses and ′ such that the implication in the denition of injectivity does not hold. So as in Example 2.55 we are looking for two elements, say and ′ of R, which are dierent, but which are mapped by the given function to the same element. As suggested by the graph, let = −1, and let ′ = 1. Then = 97 (−1)2 = 1, and ′ = 12 = 1, and since = −1 ̸= 1 = ′ we have found a counterexample. In the following example we show how a failing proof for injectivity can be turned into a counterexample for that property. This is a good strategy to follow if you cannot see from the denition of the given function whether it is injective or not. Example 2.59. Assume we have the function : C C + 2− 2. We would like to work out whether or not it is injective. We do this by starting with a proof to see if we can either complete the proof, or whether that leads us to a counterexample. Assume we have +, ′+′ inCwhich are mapped to the same element by , that is 2− 2 = (+ ) = (′ + ′) = 2′ − 2′′. Since two complex numbers are equal if and only if their real and imaginary parts are equal this implies that 2 = 2′ and − 2 = −2′′. From the rst equality we may deduce that = ′. However, the second equality says that −2 = −2′′ from above = −2′ since = ′. This implies that = ′, but that does not allow us to conclude that = ′ since might be 0. We can use the reason that this proof fails to help us construct a counterexample: We are unable to show that our two numbers have the same imaginary part if their real parts are 0. So if we use + = 0 + and ′ + ′ = 0− we can see that (+ ) = = 0− 2 · 0 · 1 = 0 and (′ + ′) = (−) = 0− 2 · 0 · (−1) = 0, and we have established that is not injective. Using something other than the original denition of injectivity is often prob- lematic. 98 I sometimes see students paraphrase injectivity as ‘for all ele- ments of the source set there is a unique element of the target set which the function maps to’. This is not the property of injectivity, this is merely the denition of a function (compare Proposition 2.1). Sticking with the given denition is simpler. If you do want to paraphrase injectivity using unique existence, then the valid formulation for a function : is: for all in the range of there exists a unique in such that = . But this is more complicated than the original denition. Exercise 35. Show that the statement above is equivalent to being injective. If there is an injection from some set to some set then we may deduce that is at least as large as . See the Section 5.2 for more detail, in particular Denition 43. Exercise 36. Show that the following functions are injective or not injective as indicated. (a) Injective: The function + 0 from R to C dened on page 51. (b) Not injective: The function 22 − 4+ 1 from R to R. (c) Not injective: The function from the set of all students enrolled on this course unit, COMP11120, to a particular examples class. (d) Injective: The function from C to C which given by . CExercise 37. Determine which of the following functions are injective. You have to provide an argument with your answer. You should not use advanced concepts such as limits or derivatives, just basic facts about numbers. Where the function is not injective can you restrict the source set to make it injective? (a) The sin function from R to R. (b) The log function from [1,∞) to R+. (c) The function 2 from N to N, or from R to R, you may choose. (d) The function used by the department from the set of rst year CS students to the set of tutorial groups. (e) The function used by the University from the set of rst year CS students to the set of user ids. (f) The function (−) from C to C. 99 (g) The function from N× N to N which maps (,) to 23. (h) The function {} from a set to the powerset . EExercise 38. Establish the following properties. (a) If is a one-element set then every function which has it as a source is injective. (b) The composite of two injective functions is injective. (c) If : and : are two functions such that ∘ is injective then is injective. (d) Show for the previous statement that need not be injective by giving an example.18 (e) Assume that : ′ and : ′ are both injections. Show that this is also true for × : × ′ × ′ (, ) (, ). In the case where we have a function from one small nite set to another we can draw a picture that makes it very clear whether or not the function is injective. Example 2.60. Consider the function dened via the following picture. ∙ ∙ ∙ ∙ 4 ∙ 1 ∙ 2 ∙ 3 We can see immediately that and are mapped to the same element, 3, and so this function is not injective. Formally, we have found two elements with = , but ̸= . If a function is given by a picture like this, then all one has to do to check injectivity is to see whether any element in the target set has more than one arrow going into it. Exercise 43 invites you to try this technique for yourself. Exercise 39. Show that if is a set with nitely many elements, and : is an injective function from to a set , then the image of under has the same number of elements as . The connection between injective functions and the sizes of sets is further explored in Section 5.2. Here is a second important property of functions. 18The smallest example concerns sets with at most two elements. You may want to read the next two paragraphs to help with nding one. 100 Denition 23: surjective A function : is19 surjective if and only if for all ∈ there exists ∈ with = . We also say in this case that is a surjection. In other words a function is surjective if its range is the whole target set, or, to put it dierently, if its image reaches all of the target set. We care that a function is surjective if we are using the source set to talk about members of the target set. It means that we can use it to access all the elements of the target set. If you are writing code that has to do something with all the elements of an array, for example, you must make sure that you write a loop that really does go through all the possible indices of the array. If you have programmed a graph, and you want to write an algorithm that visits each element of the graph, you must make sure that your procedure does indeed go to every such node. Example 2.61. Once again, when we have real world example it is impossible to formally prove that a given assignment is a surjective function unless we know how it is dened. However, you should be able to tell whether the assignment ought to be surjective, and you should be able to come up with ways of testing this, and write a program that conrms it. If you construct a mailing list that emails all undergraduate students on a specic course unit you must make sure that your list contains all the students on that course. Example 2.62. The simplest example of a surjective function is the identity function id : on a set . We give a formal proof. We have to show a ‘for all’ statement, so let in the target of the function, which is . We have to nd a witness in the form of an element of the source set of the function which is mapped to . For this we can pick itself, since id = . Example 2.63. Consider the function : N {2 ∈ N | ∈ N} 2. In order to show that this function is surjective we have to show a statement of the ‘for all there is’ kind. By Table 2.1 we may do this by assuming that we have an arbitrary element for the ‘for all’ part, and then we have to nd a witness so that the nal part of the statement hold. So let ∈ {2 ∈ N | ∈ N}. 19Note that some people call such functions ‘onto’ instead. 101 By denition this means that there is ∈ N with = 2. This has the desired property since = 2 = , and so is the required witness. One can again take the graph of a function to help decide whether a given function is surjective. It can be tricky, however, to determine the answer from looking at the graph. Instead of looking whether there is a horizontal line which intersects the graph in at least two points we now have to worry about whether there is a horizontal line that intersects the graph not at all. For some functions this can by quite dicult to see. Example 2.64. Consider for example the function from R+ to R+ given by log (+ 1). It is really dicult to judge whether some horizontal line will have an intersection with this graph or not. The picture above tells us that there is a number (namely 3) whose image is 2. But for the picture below it is far less clear whether there is an intersection between the line and the graph of the function. You might argue that the problem would be solved if we drew a larger part of the graph, but then we could also move the horizontal line higher up (remember that one has to show that one can nd an intersection for every horizontal line). Example 2.65. We show formally that a surjective function is given by the previous example, : R+ R+ log (+ 1). We proceed following the same blueprint as in Example 2.63. Let be an arbitrary element of the target set R+. We have to nd an element of the 102 source set which is mapped to , that is we are looking for ∈ R+ such that log (+ 1) = . We can solve this as an equation where is given and is unknown: This equation is true if and only if + 1 = 2log (+1) = 2, which holds if and only if = 2 − 1. Note that it is very easy, in a case like the above, to write something that is not a valid proof. The statement we need is that if we dene = 2 − 1 then = . I have seen many student answers which say log (+ 1) = so = 2 − 1. The important thing to note here is that the two statements are connected by an ‘if and only if’, that is, satises the left hand equality if and only if it also satises the right-hand one. But in general, when students start with = and perform a number of steps to arrive at some statement for , they typically have derived a necessary condition for . Only when all these steps are reversible will dening in the given way guarantee that it satises the original equation. Otherwise it is necessary to take dened in the given way and to check that it really does give a solution to the original problem. Tip A correct argument starts with a correct statement, and then applies a number of valid rules to get to the target statement. Implicitly this means that we read such arguments as the current line implying the next one.20 If you are constructing an argument which you intend the reader to inter- pret ‘backwards’, that is, the current line implies the previous line you have to indicate this in your text (and make sure your justications work in the intended direction). Example 2.66. The function : Z Z + 1 20The fact that you can derive a valid statement from the given one does not imply anything about the validity of the given statement. 103 is surjective. Surjectivity is a statement of the ‘for all . . . ’ kind, so following Table 2.1 we assume we are given ∈ Z. The remainder of the surjectivity property is a ‘there exists’ statement, so by to the same table we have to nd a witness, say ∈ Z. This witness has to satisfy = . Inserting the denition of , this means we need to pick such that + 1 = = , so we pick = − 1 and this has the required property since = (− 1) = (− 1) + 1 = , which establishes that is surjective. Example 2.67. The function : R R 2 is not surjective. To show this we want to nd a counterproof to a statement of the ‘for all . . . ’ kind. According to Table 2.1 means we have to nd a witness in the target R of the function that does not satisfy the remainder of the property. Which property is this? It’s a property of the ‘there exists’ kind, so follow- ing the same table we have to show that no in R satises that 2 = . Putting it like this should give us the right idea: We choose = −1, and then no real number can be squared to give . Alternatively, looking at the graph of this function, see Example 2.58, we can see that any negative number would work as the required witness. In the following example we illustrate how a failing proof of surjectivity can be turned into a counterexample for that property. Example 2.68. We again use the function from Example 2.59, : C C + 2− 2, and look at the question of whether it is surjective. Once again we see how far we can get with a proof of that property. For that we assume that we have an element of the target set, say + . We have to nd an element of the source set, say + , with the property that (+ ) = + . If such an + exists then it must be the case that 2− 2 = (+ ) = + . Since two complex numbers are equal when both, their real and their imaginary parts, are equal we know that for this to be valid we must have 2 = and − 2 = . 104 We may think of these as equations in and that we are trying to solve. We can solve the rst equation by setting = 1 2 . However, the second equation then becomes = −2 second equation = − x = /2 and − = is an equation that we cannot solve when = 0. Once again we can use this information to nd a counterexample: If = 0 and is a number other than 0, say 1, then there is no element21 of the source set that is mapped to + = : Given an element of the source set + , if 2− 2 = (+ ) = + = 0 + then it must be the case that = 0 to make the two real parts equal, but in that case we have that 2 = 2 · 0 · = 0, which is not equal to the given imaginary part 1. Tip Proving that a function : is surjective amounts to solving an equa- tion: given ∈ we have to nd ∈ with = . You can think of as the variable in that equation, and as a parameter that is unknown but xed. It may be a good idea to make sure that you give typical variable names, like , and to the quantity you are trying to nd, and typical ‘parameter’ names, like letters earlier in the alphabet, to the quantity which is given (but unknown). If there is a surjection from a set to a set then we may deduce that is at most as large as . See Lemma 108 in Section 5.2. Exercise 40. Show that the following functions are surjective or not surjective as indicated. (a) Surjective: The function || from Z to N. (b) Surjective: The function used by the department from the set of rst year CS students to the set of tutorial groups. (c) Not surjective: The function used by the University from the set of all students currently in the university to the set of valid student id numbers. 21The argument given above already establishes that this is the case but I spell it out here again to make it clearer to see why that is. 105 (d) Not surjective: The function + 0 from R to C given on page 51. CExercise 41. For the following functions determine whether they are sur- jective and support your claim by an argument. You should not use advanced concepts such as limits or derivatives, just basic facts about numbers. (a) The function from Q to Q given by {︃ 0 = 0 1/ else. (b) The function from R to R given by 4 − 100. (c) The function from C to C given by . (d) The function from C to R given by ||. (e) The function that maps each student enrolled on COMP11120 to a particular examples class. (f) The function that maps each member of your tutorial group to one of the values and , depending on whether they were born in Europe () or in the rest of the world ( ). (g) The function from N× N to N given by (, ) . (h) The function from the nite powerset of N, { ⊆ N | has nitely many elements}, to N that maps to the number || of elements of . Exercise 42. Establish the following statements (a) The composite of two surjections is an surjection. (b) If : and : are functions such that ∘ is surjective then is surjective. (c) Establish that in the previous statement need not be surjective by giving an example. (d) Assume that : ′ and : are both surjections. Show that this is also true for × . Again, if we are looking at functions between small nite sets then we can easily work out whether a function is surjective by drawing a picture. Example 2.69. Consider the function given by the following diagram. 106 ∙ ∙ ∙ ∙ 4 ∙ 1 ∙ 2 ∙ 3 This function is not surjective since there is no element of the source set that is mapped to the element 4 of the target set. For a function to be surjective all one has to check is that every element of the target set (on the right) has at least one arrow going into it. This example is not surjective. CExercise 43. For the following functions draw a picture analogous to the above and determine whether or not it is injective and/or surjective. (a) The function from {0, 1, 2, 3, 4} to itself which maps the element to mod 3. (b) The function from {0, 1, 2, 3, 4, 5, 6, 7} to {0, 1, 2, 3} which maps to mod 4. (c) The function from {0, 1, 2, 3, 4} to { ∈ N | ≤ 9} which maps to 2. (d) The function from the set of members of your tutorial group to the set of letters from to , which maps a member of the group to the rst letter of their rst name. (e) The function that maps the members of your tutorial group to the set {M,F} depending on their gender. We need two further notions for functions. First of all there is a name for functions which are both, injective and surjective. Denition 24: bijective A function : is bijective if and only if it is both, injective and sur- jective. We say in this case that it is a bijection. Example 2.70. The simplest example of a bijective function is the identity function on a set . Examples 2.53 and Example 2.62 establish that this function is both, injective and surjective. Example 2.71. Consider the function from Z to Z given by + 1. It is shown in Example 2.66 that this function is surjective and so it remains to show that it is also injective. Following Table 2.1 to show a ‘for all’ statement we have to assume that we have arbitrary elements and in Z, and that these have the property on 107 the left hand side of the ‘implies’ statement, that is = . From this we wish to prove = . If we insert the denition of then the given equality means that + 1 = = = + 1, and by deducting 1 on both sides we deduce = . This establishes that is also injective, and so it is bijective. Exercise 44. Determine which of the following functions are bijections. Justify your answer. (a) The function from Q to Q given by {︃ 0 = 0 1/ else (b) The function from C to C given by . (c) The function from Z to N given by ||. (d) The function from C to R+ given by ||. Exercise 45. Show that if : and : are two functions and ∘ is a bijection then is an injection and is a surjection. Recall that we may think of a function that attaches to every element from the source set a label from the target set . A bijection is a very special such function. • If the function is injective then we know that the label attached to each element of the source set is unique, that is, no other element of that set gets the same label. • If the function is surjective then we know that all the labels from the target set are used. If we have two sets with a bijection from one to the other then these sets have the same size—this idea is developed in Section 5.1, see in particular Exercise 107. Whenever we have a bijection there is a companion which undoes the eect of applying . In other words, we get a function from the target set to the source set which reads the label and gives us back the element it is attached to. 108 Denition 25: inverse function A function : is the inverse of the function of : if and only if ∘ = id and ∘ = id . Note that if is the inverse function of then is the inverse function of since the denition is symmetric. Example 2.72. Consider the function : Z Z + 1. In Example 2.71 it is shown that this function is a bijection. This function has an inverse (and indeed, Theorem 2.4 tells us that every bijection has an inverse). To give this inverse we need to nd a function which ‘undoes’ what does, and the obvious candidate for this is the function given by − 1. We show that is indeed the inverse function for . Assume that ∈ Z. We calculate ( ∘ ) = () Denition 12 = (+ 1) def = (+ 1)− 1 def = arithmetic and so we know that ∘ = idZ. We also have to show that the other composite is the identity, so again assume we have ∈ Z. We calculate ( ∘ ) = () Denition 12 = (− 1) def = (− 1) + 1 def = arithmetic, so we also have ∘ = id , and both equalities together tell us that is indeed the inverse function for . We illustrate how the properties of a function from to say something about a function one may construct going from to : Note that the proposition tells us that for an injective function we can nd a function which satises one of the two equalities required for inverse functions. Proposition 2.2 The function : is an injection if and only if either is empty or we can nd a function : such that ∘ = id . 109 Proof. We begin by assuming that the function is an injection. If is empty then the function satises the denition of injectivity. If is non-empty we pick an arbitrary element ∙ of . Now given ∈ we would like to dene as follows: : {︃ if there is ∈ with = ∙ else First of all we have to worry whether this does indeed dene a function— we need to ensure that in the rst case, only one such can exist. But since is an injection we know that = ′ implies = ′ and so is indeed unique. Hence our denition does indeed give us a function . Secondly we have to check that the equations for and holds as promised. Given ∈ we calculate ( ∘ ) = () def composition = def = id def identity function, which completes the proof. Now assume that we have , and a function as given. We want to show that is injective. Assume that we have and ′ in such that = ′. We apply on both sides and obtain that = id def id = ( ∘ ) ∘ = id = () def ∘ = (′) = ′ = ( ∘ )′def ∘ = id ′ ∘ = id = ′def id . If we have a surjective function we get a function that satises the other inequality for an inverse function: Proposition 2.3 A function : is surjective if and only if there exists a function going in the opposite direction, that is : , such that ∘ = id . Proof. We assume rst that the function is surjective. This means that for every in we can nd ∈ such that = . Of course there may be many potential choices of such an , depending on the function . We denea to be the function which gives us such an for each input . Then by denition 110 we have for each in that () = by the construction of . The proof that if we have a function as described then is surjective is given in the solution to Exercise 47. aStrictly speaking this uses some non-trivial set theory, but we don’t have time to worry about that here. Exactly what is needed depends on the proof that is surjective. Theorem 2.4 A function : is a bijection if and only if it has an inverse function. Proof. We carry out the proof in two parts to reect the two directions of ‘if and only if’. Assume that is a bijection. This means that, in particular, it is a surjection, so ‘ for every in there is ∈ with = . We would like to dene : by , for this . A priori it is not clear that this denes a function—how do we know that there exists precisely one such for each ? Existence follows from surjectivity of . Uniqueness comes from injectivity of that function: Assume we have and ′ in with = = ′. This implies = ′, and so we have indeed dened a function. We next show that is indeed the inverse of . To show that ∘ = id let ∈ . Then ( ∘ ) = def function comp = def = id def id The last but one step requires further elaboration. Recall that the denition of is to map ∈ to the unique ∈ with = . But this means that when is applied to an element of the form it returns . To show that ∘ = id , let ∈ . Then ( ∘ ) = () def function comp = def = id def id Again the last but one step requires further justication. We have dened to be the unique element of with = , so by applying on both sides we get = . Now assume that we have an inverse function for . We have to show that is both, an injection and a surjection. For the former, let , ′ ∈ with = ′. Then = id def identity function = ( ∘ ) ∘ = id 111 = () def function composition = (′) = ′ = ( ∘ )′ def function composition = id ′ ∘ = id = ′ def identity function To see that is also surjective, let ∈ . Then () = since ∘ = id , so is an element with the property that applying to it results in . Note that the proof given above combines the proofs of Propositions 2.2 and 2.3 with minor alterations. Note that if we wish that a function is bijective we may use this result and instead produce an inverse function. Example 2.73. Consider the function : C C + . We show that this function has an inverse. We need to nd a function that ‘undoes’ the action of , which takes a complex number and moves it ‘up’ one unit by increasing the imaginary part by 1. To reverse that eect all one has to do is to move it ‘down’ by one unit, so we claim that : C C − is the inverse of . The formal proof of this is not long. Based on Denition 25 we have to establish that ∘ = idC and ∘ = idC, which by denition of the equality of two functions (compare Example 2.18) means establishing the two equalities that follow. Let ∈ C. ( ∘ )() = () def funct comp = ( − ) def = ( − ) + def = . ( ∘ )() = () def funct comp = ( + ) def 112 = ( + )− def = . Hence we may conclude that the function is bijective, as is the function . We give another example for a function that is injective and surjective, and show how to nd its inverse. Example 2.74. Assume we have the function : C C + 2− + (+ 2). We want to know whether it is injective and/or surjective. Injectivity Assume we have two elements of the source set, say + and ′ + ′ which are mapped by to the same element of the target set, that is 2− + (+ 2) = (+ ) = (′ + ′) = 2′ − ′ + (′ + 2′). This means that the real and imaginary parts of these two numbers must be equal, so we must have 2− = 2′ − ′ and + 2 = ′ + 2′. The rst equality gives us that ′ = 2(′ − ) + , and inserting that into the second equality gives + 2 = ′ + 2(2(′ − ) + ) = 5′ − 4+ 2. We add 4 and subtract 2 on both sides to obtain 5 = 5′, from which we may deduce, by dividing by 5 on both sides, that = ′. Inserting this back into the equality for ′ we get that ′ = 2(′ − ) + = 2(− ) + = , and so we have established that overall, + = ′ + ′, which means that our function is injective. 113 Surjectivity Let us assume we have an element + of the target set. We want to nd an element + of the source set with the property that 2− + (+ 2) = (+ ) = + , so we try and nd solutions for and . Again we know that the real and imaginary parts must be equal, so we may deduce that 2− = and + 2 = . We can see that for the rst equation to hold it is sucient that = 2− , and inserting this into the second equation we get 5− 2 = + 2(2− ) = + 2 = , so if we set = + 2 5 , and = 2− = 2(+ 2) 5 − = 2+ 4− 5 5 = 2− 5 we have found and that solve our equation. It’s a good idea to check that we haven’t made a mistake, so we calculate (︁ + 2 5 + 2− 5 )︁ = 2(+ 2)− (2− ) + (+ 2+ 2(2− )) 5 def f = 2+ 4− 2+ + (+ 2+ 4− 2) 5 calcs in R = 5+ 5 5 calcs in R = + (+ ) = + . Hence our function is indeed surjective. Inverse function Since we have established that is bijective we know that it has an inverse function. That means that we want to dene a function : C C with the property that ∘ = idC and ∘ = idC. The second equality tells us that has to undo the eect of , and can use the work we did to show that is surjective to help us. There we answered the question of which element + is mapped by to a given element + of 114 the target set, which amounts to also answering the question of how to undo the eect had on its input to give the output + . In other words we want to write an assignment that maps + to + , where + is the solution we worked out above. The real part of the result has to be equal to (+ 2)/5, while the imaginary part of the result has to be equal to (2− )/5, so we set : + 1 5 (+ 2+ (2− )) . We formally show that this is indeed the inverse function for . Let + ∈ C. Then ((+ )) = (2− + (+ 2)) def = (+ 2) + 2(2− ) + (2(+ 2)− (2− )) 5 def = + 2+ 4− 2+ (2+ 4− 2+ ) 5 calcs in R = 5+ 5 5 calcs in R = + (+ ) = + , while also ((+ )) = (︁ + 2 5 + 2− 5 )︁ def = 2(+ 2)− (2− ) + (+ 2+ 2(2− )) 5 def f = 2+ 4− 2+ + (+ 2+ 4− 2) 5 calcs in R = 5+ 5 5 calcs in R = + (+ ) = + . Note how the second proof is almost identical to the one at the end of the surjectivity argument. So when we do a surjectivity proof then if our function is also injective we get • the assignment that gives us the inverse function and • one of the two proofs that it is indeed the inverse function. Sometimes giving an inverse function can be easier than doing separate in- jectivity and surjectivity proofs. If you can give an inverse function to a given function then you may use Theorem 2.3 to argue that the given function is bijective. Exercise 46. Let : be a function. Let [] be the image of under in (also known as the range of , see Denition 13). We may dene a 115 function ′ as follows. ′ : [] . Show that if is injective then ′ is a bijection. EExercise 47. Calculate the inverse for the function from C to C given by + 2− 3 and show it is the required inverse. Without using Theorem 2.4 show how you can use the inverse function to give a surjectivity proof. You can either do that for the function given, or in general which completes the proof of Proposition 2.3. Use the inverse function to show that the given function is surjective. Exercise 48. Recall from Exercise 31 the set Fun(, ) of all functions from a set to itself. We dene a subset of this set Bij(, ) = { ∈ Fun(, ) | is a bijection}. Show that the composite of two bijections is a bijection. This means that we can use function composition to dene a binary operation Bij(, )× Bij(, ) Bij(, ), which is again a monoid. Show that the inverse function of an element of Bij(, ) which is known to exist by Theorem 2.4 is its inverse with respect to the function composition operation. Conclude that Bij(, ) is a group (under the composition operation). 116 Chapter 3 Formal Logic Systems This material is now taught by Renate Schmidt, and you will get her notes for the material. A version of this material when it was taught by me is available from the course webpage. This is particularly intended for students on the JH Computer Science and Mathematics programme who do not have access to notes on logic otherwise. 117 Chapter 4 Probability Theory Probabilities play a signicant role in computer science. Here are some examples: • One mechanism in machine learning is to have estimates for the relative probabilities of something happening, and to adjust those probabilities as the system gets more data. The most popular way of doing this is Bayesian updating, see Section 4.3.4. • If you are running a server of some kind you need to analyse what the average, and the worst case, load on that server might be to ensure that it can satisfy your requirements.1 Calculating such averages is one of the techniques you learn in probability theory. • When trying to analyse data you have to make some assumptions in or- der to calculate anything from the data. We look at the question of what assumptions have what consequences. • In order to calculate the average complexity of a program you have to work out how to describe the relative frequency of the inputs, and then calculate the average number of steps taken relative to these frequencies. This means you are eectively calculating the expected value of a random variable (see Section 4.4.6). • There are sophisticated algorithms that make use of random sampling, such as Monte Carlo methods. In order to understand how to employ these you have to understand probability theory. 4.1 Analysing probability questions Before we look at what is required formally to place questions of probability on a sound mathematical footing we look at some examples of the kinds of issues that we would like to be able to analyse. In computer science we are often faced with situations where probabilities play a role, and where we have to make the decision about how to model the situation. Every time we are trying to judge the risk or potential benets of a given decision we are using probabilistic reasoning, possibly without realizing it. We have to come up with a measure of how big the potential benet, or the potential disadvantage is, and temper that judgement by the likelihood of it occurring. 1You wouldn’t the student system to go down if all students are trying to access their exam timetable at the same time. 118 When somebody buys a lottery ticket, the potential disadvantage is losing their stake money, and the potential advantage is winning something. How many people know exactly what their chances are of doing the latter? Many games include elements of chance, typically in the form of throwing dice, or dealing cards. When deciding how to play, how many people can realistically assess their chances of being successful? In machine learning, one technique is to model a situation by assigning probab- ilities to various potential properties of the studied situation. As more information becomes available, these probabilities are updated (this constitutes ‘learning’ about the situation in question). How should that occur? When looking at questions of the complexity of algorithms, one often applied measure is the ‘average complexity’, by which we mean the complexity of the ‘average case’ the program will be applied to. How does one form an ‘average’ in a situation like that? All these questions are addressed in probability theory, but we have to restrict ourselves here to fairly basic situations to study the general principles. The rst few problems we look at are particularly simple-minded. 4.1.1 Simple examples Most people will have been confronted with issues like the following. Example 4.1. An example much beloved by those teaching probabilities is that of a coin toss. When a fair coin is thrown we expect it to show ‘heads’ with the same probability as tails. For the chances to be even, we expect each to occur with the probability of 1/2. What if we throw a coin more than once? We also expect that the outcome of any previous toss has no inuence on the next one. This means we expect it to behave along the following lines. 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 In order to work out the probability of throwing, say, , we follow down the unique path in the tree that leads us to that result, and we multiply the probabilities we encounter on the way down, so the probability in question is 1 8 . Note that because each probability that occurs in the tree is 1/2, the eect will be that each outcome on the same level will have the same probability, which is as expected. 119 The tree also allows us to work out what the probability is of having the same symbol three times, that is having or , which means the event2 {,} occurring. All we have to do is to add up the probabilities for each of the outcomes in the set, so the probability in question is 1 8 + 1 8 = 1 4 . See Section 4.1.4 for more examples where it is useful to draw trees. Example 4.2. Whenever we throw a die, we expect each face to come up with equal probability, so that the chance of throwing, say, a 3 at any given time is 1/6. It is quite easy to construct more complicated situations here. What if we throw two dice? What are the chances of throwing two 1s? What about throwing the dice such that the eyes shown add up to 7? See Exercise 54 and Example 4.23 for a detailed discussion of this particular question. There are games where even more dice come into the action (for example Risk and Yahtzee), and while computing all probabilities that occur there while you’re playing the game may not be feasible, it might be worth estimating whether you are about to bet on something very unlikely to occur. Example 4.3. A typical source of examples for probability questions is as a measure of uncertainly of something happening. For example, a company might know that the chance of a randomly chosen motherboard failing within a year is some given probability. This allows both, the producing company and other manufacturers using the part, to make some calculations regarding how many cases of repairs under warranty they are likely to be faced with. In particular, if you are a manufacturer seeking to buy 100,000 mother- boards, then you have to factor in the costs of using a cheaper, less reliable part, compared with a more expensive and more reliable one. If you have a 10$ part which has a 5% chance to be faulty within the given period, you would expect to have around 100, 000 · .05 = 5000 cases. If on the other hand, you have a 12$ part that has a 3% chance of being faulty then you will have to pay 200,000$ more for the parts, and expect to have only 100, 000 · .03 = 3000 cases of failure under warranty. What is the better choice depends on how expensive it is to deal with each case, how many people you expect to make a claim, and whether you worry about the reputation of your company among consumers. Decisions, decisions. . . 2This is formally dened in Denition 28—for now just think of it as any set of outcomes. 120 Example 4.4. When you are writing software you may wonder how well your program performs on the ‘average’ case it will be given. For a toy example, assume that your program takes in an input string, does some calculations, and returns a number. The number of calculation steps it has to carry out depends on the length of the input string. You would like to know how many calculation steps it will have to carry out on average so that you have an idea how long a typical call to that program will take. Assume we have a string of length . There is a function which assigns to each ∈ N the number of calculation steps performed for a string of that length. It may not be easy to calculate that function, and you will learn more about how one might do that in both, COMP112 and COMP261. For the moment let’s assume the function in question is given by the assignment 2 from N to N. So now all we need is the average length of an input string to calculate the average number of calculations carried out. But what is that? This will depend on where the strings come from. Here are some possibilities: • The strings describe the output of another program. • The strings are addresses for customers. • The strings encode DNA sequences. • The strings describe the potential status of a robot (see Example 4.45). • The strings are last names of customers. In each situation the average length will be dierent. You need to know some- thing about where they come from to even start thinking about an ‘average’ case. If we have a probability for each length to occur then we can calculate an average, see Denition 38 for that. Note that typically the number of instructions that has to be carried out in a typical computer program depends on more than just the size of the input. With many interesting algorithms (for example searching or sorting ones) what exactly has to be done depends on the precise nature of the input. See Examples 4.97 and 4.99 for a discussion of two such situations. 4.1.2 Counting When modelling situations using probability we often have to count how many possibilities there are, and how many of those have particular properties. We give some rules here that help with taking care of this. Selection with return Assume we are in a situation where there are options to choose from, and that we may choose the same option as many times as we like. If we choose many 121 times and we record the choices in the order we made them, then there are possible dierent possibilities. Example 4.5. If we toss a coin then on each toss there are two options, heads and tails. If we toss a coin times then there are 2 many possible combinations. Example 4.6. Let’s assume we have various avours of ice cream, and we put scoops into a tall glass so that they sit one above each other. If you may choose 3 scoops of ice cream from a total of avours then there are 3 many combinations, assuming all avours remain available. Below we show all the combinations of picking two scoops from three avours, say hazelnut, lemon, and raspberry. There are 32 = 9 possible combinations. The reason this is known as ‘selection with return’ is that if we think of the choice being made by pulling dierent coloured balls from an urn (without being able to look into the urn), then one should picture this as drawing a ball, recording its colour before returning it to the urn, drawing a second ball, recording its colour before returning it, and so on. Selection without return If we have a choice of possibilities, and we choose times in a row, but we may not choose the same item twice, then there are (− 1) . . . (− + 1) = ! (− )! dierent combinations, that is listings of choices in the order they were made. 122 Example 4.7. If you have to pick three out of fteen possible runners to nish rst, second and third in that order there are 15 · 14 · 13 = 2730 possibilities. Example 4.8. If you have a program that gives you a design for a webpage, where you have to pick three colours to play specic roles (for example, background, page banner, borders), and there are 10 colours overall, then you have 10 · 9 · 8 = 720 combinations. Example 4.9. Returning to the ice cream example, if children are given a tall glass in which they each are allowed two scoops from three avours, but they may pick every avour at most once (to make sure popular avours don’t run out) then they have the following choices. There are now 3 · 2 = 6 possibilities. This is known as selection without return because we can think of it as having an urn with dierently coloured balls, from which we choose one ball after the other, without returning them to the urn and recording the colours in the order they appear. What happens if the balls don’t each have a unique colour? Ordering If we have dierent items then there are ! many ways of ordering them, that is, of writing them one after the other. This is the same as choosing without return times from possible options. If the items are not all dierent then the number of visibly dierent possibilities is smaller. Example 4.10. If we have a red, a blue, and three black mugs and we are lining them up in a row then the number of possibilities is 5! 3! = 20. There would be 5! possibilities for lining up 5 dierent mugs, but in each one of those we wouldn’t spot the dierence if some of the black mugs were swapped. There are 3! ways of lining up the three black mugs (but if we assume that 123 the mugs are indistinguishable then we cannot tell the dierence between the dierent orderings). In general, if we have items and there are 1 copies of the rst design, 2 copies of the second, and so on, to items of the th design then there are (1 + 2 + · · ·+ )! 1! · 2! · · · · · ! visibly dierent ways of lining up the items. Selection without ordering Sometimes we are confronted with the situation where we have to count how many dierent selections there are, but where we are not told the order in which this selection arises. A typical example is a lottery draw: One way of counting these is to list all the options as we have done above, but that gets cumbersome if the numbers involved are bigger. An alternative way of counting is to count how many selections there are with ordering being taken into account, and then dividing by the number of dierent orderings there are for each choice. Example 4.11. If we return to Example 4.9, then we can look at the situation where the children are given a shallow bowl rather than a tall glass with scoops of ice cream. Again they are allowed to choose two scoops from three avours, and again they may pick every avour at most once. We know from Example 4.9 that there are 6 possible combinations when the order is taken into account. For each choice of two avours there are two ways of ordering them, so we now have 3 · 2 2 = 3 combinations. In general, when items are picked from a choice of dierent ones, there are (− 1) . . . (− + 1) ! = ! (− )!! dierent selections. Summary The formulae given above for the number of possibilities are summarized in the following table. Here is the number of items available and is the number of items that are selected. Note that the assumption is that in the unordered case, all items are dierent. 124 ordered unordered with return without return ! (− )! ! (− )!! Note that there is no simple formula for the number of possibilities there are when looking at unordered selections of items some of which may be identical. In this case the formula for the number of dierent orderings may be useful. This says that if there are items, of which there are 1 indistinguishable copies of a particular kind, 2 copies (also indistinguishable among themselves) of a second kind, and so on, with many kinds altogether, then there are (1 + 2 + · · ·+ )! 1! · 2! · · · · · ! many visibly dierent orderings. Optional Exercise 7. Work out why there is no simple formula as discussed in the previous paragraph‘ by looking at some examples. Exercise 49. Assume you have 3 red socks and 5 black ones. Answer the following questions (a) Assume we put all the socks into a bag. Four times we draw a sock from the bag, putting it back each time. How many dierent draws are there? (b) Make the same assumption as for the previous part, but now assume we don’t put the drawn socks back into the bag. How many draws are there? (c) Assume we put the socks onto a pile, close our eyes, mix them around, and pick four socks from the pile. How many dierent combinations do we get? (d) Can you answer the same questions if you assume we have red and black socks? What if we pick socks (for ≤ + ) many socks on each occasion? Exercise 50. A researcher in the rain forest has left his laptop unattended and a curious monkey has come to investigate. When the researcher looks up from the plant he is studying he sees the monkey at the keyboard. He makes threatening noises as he runs back. Assume that every time he shouts there’s a 50% chance that he will manage to disrupt the monkey before it makes another key stroke, and that he will have reached the laptop before he has shouted six times. Draw a tree similar to that in Example 4.1 for the situation. What do you think is the average number of key strokes the monkey will manage in this situation? 4.1.3 Combinations Sometimes we have to combine these ideas to correctly count something. 125 Example 4.12. If we throw a coin three times then there are 23 many possible outcomes. If we want to know how many of those contain at least two heads we have to think about how best to count the number of possibilities. One possibility is to say that we are interested in • the situation where there are three heads, of which there is one combin- ation, and • the situation where there are two heads and one tails. This asks for the number of dierent ways of ordering ,, and there are 3! 2! = 3 of those (or there are the positions where the unique can go and then the two take up the remaining positions). But this way of thinking does not scale well. What if we want to know how many outcomes have at least 10 heads when we toss the coin 20 times? Following the above idea we have to add up the number of combinations with 20, 19, 18, and so on, down to 10 occurrences of . Or we can argue that there are 220 possibilities overall, of these 20!/(10! · 10!) contain exactly ten times heads and ten times tails and of the remaining combinations half will have a higher count of heads, and half will have a higher count of tails. There are 20! 10! · 10! = 2 · 19 · 2 · 17 · 2 · 15 · 2 · 13 · 2 · 11 5! = 19 · 17 · 2 · 13 · 2 · 11 1 = 184756 ways of ordering ten heads and ten tails. The number of combinations of at least 10 heads is then 220 − 184756 2 + 184756 = 220 2 + 184756 2 = 616666. By thinking about how to count in the right way calculations can be shortened signicantly. CExercise 51. Work out how many outcomes there are in the following cases. Please give an expression that explains the number you have calculated. (a) Four digit personal identication numbers (PINs). How many times do you have to guess to have a 10% chance of nding the correct PIN? (b) How many passwords are there using lower case letters? How many times do you have to guess now to have a 10% chance of being correct? (c) What if upper case letters are included? (d) How many possible lottery draws are there if six numbers are drawn from 49? How many bets do you have to make to have a 1% chance of having all numbers correct? (e) Assume you have an array consisting of 10 dierent integers. What is the 126 probability that the array is sorted? What happens if the integers are not all dierent? (f) Assume you have an array consisting of 30,000 id numbers. What is the probability that you randomly pick the one you were looking for? What can you say about the case where the array is sorted? (g) In an examples class there are 60 students and 6 TAs. Each TA marks 10 students. Assuming the students all have sat down in groups of ten, how many dierent combinations of TAs and groups are there? What is your chance of having a particular TA this week? (h) Assume that there are 6 people who want to randomly split into three teams. For this purpose they put two red, two green and two yellow ribbons into a bag, and each person picks one of those out without looking into the bag. What is the probability that Amy will be on the red team? What is the chance that she will be on the same team as Zenia? How many dierent ways of splitting the six members into teams are there? (i) Students from CSSOC are wearing their hoodies. Four of them have a purple, two a green, and one a black one. They line up in a queue to leave the room they are in. What is the probability that all the people in the same colour hoodie are next to each other? What is the probability that no two people wearing a purple hoodie are next to each other? Exercise 52. Work out how many outcomes there are in the following cases. Please give an expression that explains the number you have calculated. (a) Assume you are at a party. Somebody asks each person when their birthday is. How many people have to be at the party for the probability that two of them share a birthday to be larger than 50%?3 (b) Assume you are composing a phrase of music over two four beat bars. You may use one octave, and any duration from a quaver (an eighth note) to a semibreve (a whole note). How many melodies are there? 4.1.4 Using trees Sometimes we can picture what happens in a situation by using trees to provide structure. Example 4.13 (Drawing socks). We may use trees to gain a better understand- ing of a particular situation. The name ‘decision tree’ is slightly misleading here since we do not just model decisions that somebody might make but also random moves. Assume you have a drawer with six individual socks, three red and three 3This is known as the birthday paradox, although it is not strictly a paradox, merely a question with a surprising answer. It is why computer scientist have to worry about collisions when designing hash tables. 127 black (let’s not worry about how you ended up with odd number of socks in both colours). We may answer the question of how many socks we have to pick in order to be sure to get one matched pair—if we pick three socks then there will be at least two which are the same. But what if we want to know how many socks we have to pick to have a chance of at least 50% of achieving this? We picture our rst two draws as follows. 2/5 3/5 1/2 3/5 2/5 1/2 What is the chance of having two socks of the same colour after two attempts? Of the four possible outcomes two are of the kind we want, namely and . In order to nd out the probability of these two events we multiply probabilities as we go down the tree. • . The probability for this event is determined by multiplying the probabilities that appear along the path from the root of the tree to that outcome, so it is 1/2 · 2/5 = 1/5. • The probability is determined in the same way, and also works out to be 1/2 · 2/5 = 1/5. To calculate the probability of the event of having two socks of the same colour, {,}, we add the probabilities of the two outcomes contained, and so we have 1 2 · 2 5 + 1 2 · 2 5 = 2 5 = 40%. Hence in order to guarantee a success rate of at least 50% we have to have (at least) three draws, and in that case we know we will have a 100% success rate. Let us look at the question of picking at least two black socks. With two draws the chance of succeeding is 2/10 = 1/5. If we add a third draw we get the following. 128 1/4 3/4 2/5 1/2 1/2 3/5 1/2 1/2 1/2 3/5 3/4 1/4 2/5 1/2 We may now calculate the probability that any of these draws occurs by multiplying the probabilities that occur along the corresponding path; we give these probabilities below each leaf: 1/4 3/4 2/5 1/2 1/2 3/5 1/2 1/2 1/2 3/5 3/4 1/4 2/5 1/2 2 40 6 40 6 40 6 40 6 40 6 40 6 40 2 40 The outcomes where we have two black socks are {,,,}, If we add up their probabilities we get 6 40 + 6 40 + 6 40 + 2 40 = 20/40 = 50%. You might arrive at this result without drawing the tree, but it certainly claries matters to have it at hand, and if you have to answer more than one question about some situation you only have to draw it once. Example 4.14 (Gold and Silver). Assume there are three bags, each with two coins. One has two coins of gold, another two coins of silver and a third one coin of each kind. Somebody randomly picks a bag, and then draws a coin from the bag without looking inside. We are shown that the selected coin is gold. What is the chance that the remaining coin from that bag is also gold? Again we use a tree to understand what is happening. 129 1/2 1/2 1/3 1/2 1/2 1/3 1/2 1/2 1/3 If we know that a gold coin has been drawn we must be seeing the rst, second or third outcome from above. All these are equally likely, with a probability of 1/6 each. Two out of the three have a second coin which is also gold, so the desired probability is 2/3. Instead of explicitly looking at both coins in the bag, as we did in the tree above, we could have a dierent event, namely the colour of the drawn coin. If those are our chosen outcomes then the corresponding tree looks like this. 1/3 1/2 1/2 1/3 1/3 Now we argue that knowing the drawn coin is gold tells us that we have either the rst or the second outcome. The former occurs with probability 1/3, the second with probability 1/6 overall, so the former is twice as likely as the latter, again giving a probability of 2/3 that the second coin is also gold. Example 4.15. Bonny4 and Clyde are playing a game. They put two yellow and four green ribbons into a bag. Without looking inside, each of them reaches into the bag and draws a ribbon. If the ribbons have the same colour Bonny wins and if they are dierent, then Clyde wins. We want to know whether the game fair, that is, if they both have an equal chance of winning. This is question is much easier to answer if we draw a tree. 1/5 4/5 1/3 2/5 3/5 2/3 From the tree we can read o that the probability of drawing the same colour, 130 the event { ,}, has the probability 1 3 · 1 5 + 2 3 · 3 5 = 7 15 , while the probability of drawing dierent colours, the event { , }, has the probability 1 3 · 4 5 + 2 3 · 2 5 = 8 15 . The two numbers are dierent and so the game is not fair. Clyde has a higher chance of winning. Example 4.16 (The Monty Hall problem). A well-known problem that we may use for illustrative purposes is known as the Monty Hall problem. Imagine you are in a game show. There are three closed doors labelled , and , and you know that behind one of them is a valuable prize (in the original story a car) and behind two of them is something not worth having (in the original story a goat). The way the game works is that you pick a door, and then the show master opens one of the remaining doors. You see the booby prize. You are now oered the chance to switch to the other closed doors. Should you switch, or stick with your original choice? This situation has been endlessly discussed among various groups of people, often because somebody knows the solution and somebody else doesn’t want to believe it. So how does one model a situation like that reliably? Usually when there are steps in a situation it is worth modelling these steps one by one. What do we know for sure? We know that at the beginning there are three doors, let’s assume with two goats and a car. We assume that the probability of the car being behind any one of the doors is the same. From the point of view of the contestant this is like a random event. The production company picks an actual door, and there is no way of telling how they decide which one to hide the main prize behind, but one might hope that they really do pick any door with the probability of 1/3, and that’s the assumption the contestant should make. The action of the show master afterwards has to depend on the choice made by the contestant, and we make the additional assumption that if the show master has a choice of opening a door he will open them with equal probability. We can model the choices step by step using a tree. In the rst step we model the fact that the car might be behind any one of the doors. A B C We put probabilities in the tree which indicate that the car can be behind each of them with equal probability. 4This is a past exam question. 131 1/3 1/3 1/3 There are three possibilities for the player to choose a door. But note that the player does not know which of these three positions she is in. In game theory one says that the leaves of the tree are in the same information set. So from the player’s point of view there are three choices (pick door , or ), and she cannot make that dependent on where the car is since she does not have that information. This is similar to the situation in many card games where the player has to choose what to play without knowing where all the cards are situated. Only in the course of further play does it become clear what situation the players were in. In the tree we denote this by a dashed line connecting the positions which the player cannot distinguish. 1/3 1/3 1/3 The next step is for the show master to open one of the doors showing the booby prize. In some cases there is only one possible door to open, in others there is a choice between two, and we assume that he picks either one of them with equal probability. 1/2 1/2 1/3 1/2 1/2 1/3 C 1/2 1/2 1/3 Now the player has to decide whether she wants to switch or not. Again we draw the possible options. We rst give the door where the player has not switched, and then the one where she has. 132 1/2 1/2 1/3 1/2 1/2 1/3 1/2 1/2 1/3 We note that the three principal subtrees are the same up to renaming of nodes. This is because from the Player’s point of view, there are only three options (which door to pick), and the rst step drawn in the tree (the selection of the door which hides the prize) is completely hidden. The result of that step is not revealed to the player until the end of the game. We look at the question of what happens if the player switches or sticks with her rst choice. In purple we highlight the case where the player’s rst choice is . The remaining two possibilities give the same result. We also highlight those position where the player wins the main prize by giving it in bold. We now look at two strategies: • Pick on the rst move and then stick with this choice, given in blue (this is the left choice in the fourth layer of the tree). • Pick on the rst move and then switch when given the chance to do so, given in red (this is the right choice in the fourth layer of the tree). A 1/2 A 1/2 A A 1/3 B B 1/2 B 1/2 B 1/3 C B C 1/2 C 1/2 1/3 If the player picks door on the rst move, and then sticks to that choice, there is a chance of 1/3 that the original choice was correct, and then no matter which door is opened the main prize is achieved. So a player who does not switch will get the main prize with probability of 1/3. 133 A player who picks door on the rst move and then switches, had the correct door with probability 1/3 and then switches away, which means that he obtains the main prize with probability 2/3. For this reason a player who picks a door and then switches has a chance which is twice as high to get the main prize than the player who sticks. vNote that in particular we can see from this example that it may be that what looks like one choice to the player (for example ‘pick door ’) is eectively a choice taken in a number of dierent situations the player cannot distinguish between (here the player does not know behind which door the prize is hidden)—in that case the choice will be reected in several subtrees. Further note that the tree drawn here is a game tree which is slightly dierent from the usual probability trees we draw. Read on for a slightly dierent treatment of the situation where we look at a dierent tree. Example 4.17 (The Monty Hall problem ctd). Above we give a game tree to describe this situation because that is a good way of capturing the interactions. How might we describe it using our idea of tress for probabilities? This time we start with the idea that we choose a door. A B C On the following level we model the actual situation. If we look at the doors , and in that order then in the actual situation we have, for example to indicate the car is behind door while there are goats behind doors and . We only draw the part of the tree where we have chosen door —as before, the others will be similar, with just the roles of the doors exchanged. Since we know nothing about the actual locations of the car and the goats we use probabilities to indicate that they are all equally likely as far as we know. A 1/3 1/3 1/3 B . . . C . . . The next thing that happens is that Monty opens one of the doors to show us a goat. We can see that when the situation is he has to open door , and when the situation is he has to open door . He has a choice regarding whether to open door or if the situation is , and since we don’t know whether he has any bias we model this by assuming that both options are equally likely. We show the revealed goat in red to indicate which door has been opened. 134 A 1/2 1/2 1/3 B . . . 1/3 C . . . 1/3 Again, we have two possible strategies if we originally choose door : • stick with our original choice or • switch to the remaining door. In the tree below we draw the wins for the two strategies in the colours indicated above. A 1/2 1/2 1/3 B . . . 1/3 C . . . 1/3 As we can now see the probability that we win if we stick is 1 3 ( 1 2 · 1 2 ) = 1 3 while the probability that we win if we switch is 1 3 + 1 3 = 1 3 . Note that we can tell if a tree properly describes a probabilistic situation: Proposition 4.1 A tree with probabilities along some of the edges describes a situation of choices and probabilistic moves if and only if for every node in the tree the probabilities of the edges going down from that node add up to 1. Tip Drawing a tree is often very useful when trying to understand probabilities. For a tree to work you have • structure the process being described into distinct stages, each of which 135 describes either a probabilistic process or a choice some agent may make (as in the Monty Hall problem); • for each such probabilistic process nd some way of describing its possible outcomes (for example all the possible deals in a card game, or all the possible rst cards you might receive in such a deal)—for example, the colour of the sock drawn in Example 4.13; • annotate each branch that stands for a particular outcome of some probabilistic process with the probability that it occurs—in the same example the probability that we draw a red/black sock, given which socks have already been drawn;5 • the leaves of the tree should cover to all the overall outcomes you are interested in—for example, the various combinations of socks we may obtain having drawn three times in Example 4.13. Note that can be several trees that describe the given situation, and which one suits you best will depend on what you are expected to calculate with that tree. Tip Once you have a tree it is easy to calculate probabilities of specic outcomes. • The probability that a given leaf (which corresponds to an overall out- come of the situation we want to describe) can be computed by mul- tiplying all the probabilities that occur on the path from the root of the tree to that leaf. • The probability that a particular set of outcomes occurs can be calculated by adding the probabilities of all the leaves of the tree which belong to that set. CExercise 53. Suppose we have a deck of four cards, {♠, ♠, ♡, ♡}. I draw two cards from this pack so that I can see their values, but you cannot. You tell me to drop one of my cards, and I do so. You ask me whether I have the ace of spades ♠ in my hand, and I answer yes. What is the probability that the card I dropped is also an ace? Hint: Draw a tree, but note that if you read the given information carefully you don’t have to draw all possibilities. How many dierent draws are there? You’ll make your life more complicated if your tree contains more nodes than needed. Exercise 54. Assume you are throwing two dice, a red and a blue one. (a) What is the probability that the sum of the eyes is exactly 4? 5Note that by the previous proposition, if we add up the probabilities annotating all the branches that start at one particular location, the result must be 1 136 (b) What is the probability that the sum of the eyes is at least seven? (c) What is the probability that there is an even number of eyes visible? (d) What is the probability that the number on the red die is higher than that on the blue? 4.1.5 Further examples In the previous sections it was clear from the context which principles you had to apply to nd a solution. The point of the following exercises is that you rst have to think about what would make sense in the given situation. EExercise 55. Assume two teams are playing a ‘best out of ve’ series which means that the team that wins three matches is the winner of the series.6 Note that once it is clear that one side has won, the remaining matches are no longer played. For example, if one team wins the rst three matches the series is over. (a) Assume that the two teams are equally matched. After what number of matches is the series most likely to end? (b) How does the answer change if the probability of one team winning is 60%? Exercise 56. Solve the same problem as for the previous exercise, but with a ‘best out of seven’ series. CExercise 57. Imagine you have a die that is loaded in that even numbers are twice as likely to occur than odd numbers. Assume that all even numbers are equally likely, as are all odd numbers. (a) What is the probability of throwing an even number? (b) What is the probability that the thrown number is at most 4? (c) With two dice of this kind what is the probability that the combined number of eyes shown is at most 5? Exercise 58. Assume you have a coin that shows heads half the time and tails the other half, also known as a fair coin. Assume the coin is thrown 10 times in a row. (a) What is the probability that no two successive throws show the same side? (b) What is the probability that we have exactly half heads and half tails? 6Such series take part, for example, in men’s matches in Grand Slam tennis tournaments, where the winner of each bout is determined in a ‘best out of ve’ sets. In women’s matches, and men’s matches outside of Grand Slam tournaments, the winner is determined in a ‘best out of three’ series. 137 (c) What is the chance of having at least ve subsequent throws showing the same symbol? Exercise 59. Assume we toss a fair coin until we see the rst heads. We want to record the number of tosses it takes. What is the probability that we require more than 10 tosses? 4.2 Axioms for probability In the examples above we have assumed that we know what we mean be ‘probab- ility’, and that we have some rules for calculating with such numbers. 4.2.1 Overview This section puts these intuitive ideas onto a rm mathematical footing. It does so in a very general way which you may nd dicult to grasp. However by setting this up so generally we give rules that can be applied to any situation. Thinking about these rules also encourages you to think about how to model specic situations you are interested in, and to take care with how you do so. The idea underlying probability theory is that we often nd ourselves in a situation where we can work with • a sample space of all possible outcomes, • a set of events ℰ (which is a subset of the powerset of ) and • a probability distribution which is given by a function : ℰ [0, 1], where [0, 1] is the interval of real numbers from 0 to 1. Example 4.18. The simplest kind of probability space is one where there are options, say = {1, 2, . . . , }, and all these occur with equal probability. In this case the set of events is the set of all subsets of , and the probability distribution : [0, 1] is given by the assignment || , that is, every set is mapped to its number of elements divided by . We give precise denitions of what we mean with these notions below, but for the moment let’s look at a slightly more complicated example. Example 4.19. In a simple dice game the participants might have two dice which they throw together. If the aim of the game is to score the highest number when adding up the faces of the dice then it makes sense to have the 138 possible outcomes = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. We call the set of possible outcomes the sample space . We could now ask what the probability is of throwing at most 5, which is the event {2, 3, 4, 5}. This example is continued below. Typically when we have a nite set of outcomes we assume that the set of events ℰ is the whole powerset . When we have an innite set of outcomes this is not always possible. There is a eld in mathematics called measure theory which is concerned with which sets of events can be equipped with probability functions, but that goes beyond this course. Often it is possible to make the sample space nite, and this frequently (but not always) happens for computer science applications. Example 4.20. The following example is a toy version of a problem that was the basis of a lab in the introductory AI unit. It is concerned with a robot wanting to learn its location in a two-dimensional space. If we think of the location as being given by two coordinates, and the coordinates as real numbers, then there are uncountably many locations in the unit square [0, 1]× [0, 1]. But we cannot measure the location of the robot up to innite precision (and indeed, we’re not interested in the answer to that level of precision), and in the robot exercise a 100× 100 grid is imposed on the space, and we are only interested in which one of the squares in the grid the robot inhabits. This means the sample space now has only 100 · 100 = 10, 000 elements.7 Example 4.21. If you are interested in the price of a commodity, it typically makes sense to measure the price only up to a limited precision (typically a few post decimal digits), and again this has the eect of making the sample space nite. We consider many nite sample spaces in this chapter, but we do have a look at innite notions as well since that also occurs in some applications in computer science. For this reason we here give denitions which are general enough to apply to both cases. Example 4.22. When we consider events that happen over a given time frame it is often more convenient to treat that time frame as a real interval, [, ′], where is the start time and ′ is the end time. The reason for this is that the entities we would like to compute can typically be computed with the help of integrals. These ideas are pursued in Examples 4.28, 4.29 and 4.85 and Exercises 63 and 88. 7See Example 4.45 for a simplied version of this scenario. 139 Not every function satises the requirements of a probability function, and we look at what properties we expect below. In order to formulate what we expect from a probability function we rst have to look at what we expect from the set of events. 4.2.2 Events and probability distributions The following two denitions are given here for completeness’ sake.8 Above we did not worry about the properties required of probability distributions, and we also did not wonder whether a given set of outcomes could be an event, or not. When we wish to consider a sample space that is uncountable9, for example the real interval [0, 1], it is dicult to nd a probability space for this set of outcomes. There are two diculties: • If we want to assign the same probability to each element of [0, 1] then this probability has to be 0 (otherwise the probability for the whole interval would be innite, compare Proposition 4.4). The only way of dening a probability function with this property is to dene a function that takes as its input events (that is, sets of outcomes). • It is not possible to give a probability distribution that assigns a probability to every subset of [0, 1], see Proposition 4.5. How to dene a probability space in this situation is sketched in Proposition 11. It has the property that the probability of any interval [, ′] in [0, 1] has a probability proportional ′ − , that is, its probability is determined by its length. For this reason we describe here which collection of subsets of the sample space is suitable to form a probability space. Denition 26: -algebra Let be a set. A subset ℰ of is a -algebra provided that • the set is in ℰ , • if is in ℰ then so is its complement ∖ and • if is in ℰ for ∈ N their union ⋃︁ ∈N is in ℰ . We note some consequences of this denition. First of all, since is in ℰ we may form its complement to get another element of ℰ , and so ∅ = ∖ is in ℰ . Further note that the union of a nite number of events must also be an event: If we nave events 0, 1, . . . , then we can set = ∅ for > , and then⋃︁ ∈N = 0 ∪ 1 ∪ · · · ∪ . Note that for every set the powerset is a -algebra. 8In particular these two denitions are not part of the examinable material. 9See Denition 46—for now stay with the example of the unit interval. 140 Events which are disjoint play a particular role: If we have two sets of possible outcomes, say and ′, and these sets are disjoint, then we expect that the probability of ∪ ′ is the probability of added to that of ′. But this is not a property of just two sets of outcomes—sometimes we need to apply it to to larger collections of sets. This means we have to worry about what the appropriate generalization of ‘disjoint’ is. If we have three sets of outcomes, events , ′ and ′′, then in order for ( ∪ ′ ∪ ′′) to be equal to + ′ + ′′ to hold it must be the case that none of these sets ‘overlap’, in other words, we need that ∩ ′ = ∅, ∩ ′′ = ∅, ′ ∩ ′′ = ∅, as for example in the following picture. ′′′ If we want to apply this idea to more than three sets we need to use a general denition. Denition 27: pairwise disjoint Let be a set. Further assume that we have an arbitrary set , and that for each element ∈ we have picked a subset of . We say that the collection of the , where ∈ , is pairwise disjoint if and only if for , ∈ we have ̸= implies ∩ = ∅. This means that the sets we have picked for dierent elements of do not overlap. Recall our denition of disjoint union from Chapter 0 as a union of sets that do not overlap. EExercise 60. Assume that we have a set . We are also given two disjoint subsets 1 and 2 of and a collection , for ∈ N, pairwise disjoint subsets of . (a) Show that for ⊆ we have that ∩1 and ∩2 are disjoint. If you can do the next part without doing this one you may skip it. (b) Show that for ⊆ we have that ∩ is a collection of pairwise disjoint sets. (c) Show that for ⊆ we have that ∩ (1 ∪2) = ( ∩1) ∪ ( ∩2). 141 If you can do the next part without doing this one you may skip it. (d) Show that for ⊆ we have that ∩⋃︁ ∈N = ⋃︁ ∈N( ∩ ). (e) Show that if ⊆ 1 ∪ 2 then is the disjoint union of ∩ 1 and ∩2. If you can do the next part without doing this one you may skip it. (f) Show that if ⊆⋃︁ ∈N then is the disjoint union of the ∩ . Denition 28: probability space A probability space is given by • a sample set ; • a set of events ℰ ⊆ which is a -algebra and • a probability distribution, that is a function : ℰ [0, 1], with the properties that – = 1 and – given , for ∈ N, pairwise disjoint10, then11 ( ⋃︁ ∈N) = ∑︁ ∈N (). These axioms for probability go back to the Russian mathematician Andrey Kolmogorov who was trying to determine what the rules are that make probabilities work so well when describing phenomena from the real world. His rules date from 1933. What we have done here is translate them into a more modern setting. These axioms may seem complicated, but they are quite short, and they have a lot of consequences which you may have learned about when studying probability previously. We look at these in the following section. Tip You are not expected to fully understand the denition of a probability space, in particular that of a -algebra, and in practice it is certainly sucient to understand the examples given in the text. The formal denition is included to demonstrate that mathematics is built entirely using formal denitions. 10Note that some authors write a disjoint union using the addition symbol +, and ∑︀ for innite such unions, but we do not adopt that practice here in case it causes confusion. 11Note that below appears a potentially innite sum, that is, a sum which adds innitely many numbers. We do not discuss these situations in general in this unit. We say a bit more about how to think of this rule in Denition 30 below. 142 Many students lose marks when asked to give a probability space, because they describe the outcomes, their probabilities but they neglect to mention the events. Study the examples in Section 4.2 until you are sure you can always identify the set of events. The following optional exercises invite you to understand more about the formal denition of a probability space. Optional Exercise 8. In the denition of a probability distribution we can see an innite sum. Under which circumstances does it make sense to write something like that? Try to nd a probability distribution for the natural numbers, with N as the set of events. Hint: It is sucient to give probabilities for events of the form {}. Optional Exercise 9. Assume you want to nd a probability distribution for the sample space [0, 1] with a -algebra which contains all sets of the form {} as events. What can you say about the probabilities of these sets? Optional Exercise 10. Assume you are given the sample set [0, 1] and you know that every interval in [0, 1] is an element of the -algebra ℰ . Further assume that you are being given a probability distribution on ℰ which maps every interval [, ′] in [0, 1] to ′ − . Convince yourself that these data satisfy the conditions for a probability space. What do you think should be the probability of the interval (, ′)? Example 4.23. We continue Example 4.19. • = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} and • ℰ = but what is the probability distribution we should use here? Since every subset of can be written as a disjoint union of sets containing one element each the second condition for probability distributions tells us that it is sucient to know the probability for each outcome since, for example {2, 3, 4, 5} = ({2} ∪ {3} ∪ {4} ∪ {5}) = {2}+ {3}+ {4}+ {5}. This still leaves us with the question of what {2}, {3}, and so on, should be. If we look at our sample space more closely we nd that it in itself can be viewed as a collection of simpler events. If we look at the outcome ‘the sum of the eyes shown by the two dice is 4’ then we see that this is a complex event: Assume we have a red die and a blue die, then the following combinations will give the sum of four (giving the red die followed by the blue one): . 143 So we might instead decide that our sample space should look dierent to make the outcomes as simple as possible to make it easier to determine their probabilities. IF we record the result of throwing the two dice simultaneously as a pair (, ), where the rst component tells us the value of the red, and the value of the blue die. Then our new sample space becomes {(, ) | 1 ≤ , ≤ 6}. If we assume that our two dice are both ‘fair’, that is, every number appears with equal probability then the event of throwing, say, a three with the red die will be 1/6, as will be the probability for all the other possible outcomes from 1 to 6. The same is true for the blue die. If we now assume that throwing the red die has no eect on the blue die12 then the probability of each possible outcome13 (, ) is 1 6 · 1 6 = 1 36 . The outcomes in our previous sample space are now events in the new space, and the probability that the sum thrown is 4, for example, (the old event {4}) is given by the new event {(1, 3), (2, 2), (3, 1)}, and its probability is the sum of the probabilities for each singleton, that is {(1, 3), (2, 2), (3, 1)} = {(1, 3)}+ {(2, 2)}+ {(3, 1)} = 1 36 + 1 36 + 1 36 = 3 36 = 1 12 . For completeness’ sake we give a full description of both probability spaces. Because the set of events is the powerset of the sample set it is sucient to give the probability of each outcome. We begin by describing the second probability space in the somewhat boring table below, where the probability for the outcome (, ) is the entry in the row labelled and the column labelled . ∖ 1 2 3 4 5 6 1 1/36 1/36 1/36 1/36 1/36 1/36 2 1/36 1/36 1/36 1/36 1/36 1/36 3 1/36 1/36 1/36 1/36 1/36 1/36 4 1/36 1/36 1/36 1/36 1/36 1/36 5 1/36 1/36 1/36 1/36 1/36 1/36 6 1/36 1/36 1/36 1/36 1/36 1/36 144 The outcome from the original space can be thought of as an event in the new space, namely that of {(, ) ∈ {1, 2, 3, 4, 5, 6}2 | + = }, and the probability of outcome in the original space is equal to the probability of the corresponding event in the new space. Below we give a table that translates the outcomes from our rst sample space to events for the second sample space. 2 3 4 5 6 7 8 9 10 11 12 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,6) (3,6) (4,6) (5,6) (6,6) (2,1) (2,2) (2,3) (2,4) (2,5) (3,5) (4,5) (5,5) (3,1) (3,2) (3,3) (3,4) (4,4) (5,4) (4,1) (4,2) (4,3) (5,3) (6,3) (5,1) (5,2) (6,2) (6,1) Hence the original probability space has a probability distribution determ- ined by the following table: 2 3 4 5 6 7 8 9 10 11 12 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 There is a third sample space one could use here: As outcomes use an ordered list [, ] of numbers, to mean ‘the die with the lower number shows and the die with the higher number shows ’. The whole sample space is then {[, ] | , ∈ {1, 2, 3, 4, 5, 6}, ≤ }, and we give the probabilities for those outcomes below. We give the lower number in the set to determine the row and the higher number for the column. ∖ 1 2 3 4 5 6 1 1/36 2/36 2/36 2/36 2/36 2/36 2 1/36 2/36 2/36 2/36 2/36 3 1/36 2/36 2/36 2/36 4 1/36 2/36 2/36 5 1/36 2/36 6 1/36 In summary, we have given here three probability spaces that describe the given situation, with dierent underlying sample spaces. We learn from this example that there may be more than one suitable sample space, and that by making the possible outcomes as simple as possible we may nd their probabilities easier to determine. If the sample space is nite then 12This property is known as independence, see Denition 30. 13Compare Example 4.78. 145 calculating the probability of any event amounts to adding up the probabilities for the individual outcomes. Exercise 61. Assume you have two dice, one red, one blue, that show a number from 1 to 3 with equal probability. You wish to calculate the probabilities for the numbers that can occur when deducting the number shown by the blue die from that of the red one. For example, if the red die shows 2 and the blue die shows 3, the number to be calculated is −1. Give two probability spaces that describe this situation and describe how to calculate the probabilities asked for in each case. Hint: If you are nding this dicult then read on to the next section which contains more worked examples. So picking a suitable sample space for the problem that one tries to solve is important. It’s not unusual to have a number of candidates, but some of them will be easier to describe correctly than others. 4.2.3 Discrete probability distributions The above example suggests the idea of the following result. It tells you that if you have a nite sample space then describing a probability space for it can be quite easy. Proposition 4.2 Let be a nite set. (i) If for each ∈ we have the probability ∈ [0, 1] that occurs, and the sum of these probabilities is 1, then a probability space is given by • the sample space is , • the set of events is the power set of , , • the probability distribution is given by {1, 2, . . . , } 1 + 2 + · · · , where ∈ N and 1, 2, . . . ∈ , which means that for for every subset of , the probability of is given by = ∑︁ ∈ . Moreover this is the only probability space where • all sets of the form {} are events and • the probability of the event {} occurring is . (ii) If (, ℰ , ) is a probability space with the property that for ∈ , {} ∈ ℰ then • ℰ = and • we may read o the probability that any given outcome occurs by considering {}. 146 Proof. (i) We have already stated that the powerset of any set is a -algebra, so it is sucient to check that the probability distribution we selected satises the required properties. • We note that the way we have dened the probability distribution, the probability of is the sum of the probabilities for the outcomes, and the assumption explicitly stated is that this adds to 1, so () = 1. • If we have pairwise disjoint events for ∈ N then the probability of⋃︁ ∈N is the sum of all the probabilities of elements in this set. But if the are pairwise disjoint then each element of ⋃︁ ∈N occurs in exactly one of the , and so ( ⋃︁ ∈N) = ∑︁ ∈ ⋃︁ ∈N def = ∑︁ ∈ ∑︁ ∈ pairwise disjoint = ∑︁ ∈N def .. (ii) The second statement really has only one property that we need to prove, namely that is the set of events for the given space. But if is nite, and all sets of the form {} are events, then for an arbitrary subset ′ of we can list the elements, for example ′ = {1, 2, . . . }, and by setting = {︃ {} for 1 ≤ ≤ ∅ else we have events for ∈ N with the property that ′ = ⋃︁ ∈N, and since ℰ is a -algebra we know that ′ ∈ ℰ . Hence every subset of is an event, and so ℰ = . Tip This proposition says that in order to describe a probability space with a nite sample space all we have to do is to • describe the sample space ; • say the -algebra ℰ is ; • give the probability for each outcome from and state that the prob- ability for each event is given by the sum of the probabilities of its 147 elements. Example 4.24. Throwing a single die can be described by the probability space given by • = {1, 2, 3, 4, 5, 6}; • ℰ = ; • the probability distribution assigns the probability of 1/6 to each out- come; the probability of an event is given by the sum of the probabilities of its elements. In practice we often leave out the last statement, or shorten it. Example 4.25. Tossing a coin can be modelled by a probability space with • sample set = {,}, • set of events ℰ = and • a probability distribution determined by the fact that each outcome occurs with probability 1/2. Example 4.26. The probability space underlying Exercise 53 has as its under- lying sample space the set {{♡, ♡}, {♡, ♠}, {♡, ♠}, {♡, ♠}, {♡, ♠}, {♠, ♠}}, and the probability for each outcome is 1/6. The probability distribution is derived from this in the usual way. Proposition 4.3 If we have a sample set = { | ∈ N}, then a probability space is uniquely determined by assigning to each element of a probability in [0, 1] such that∑︁ ∈N = 1. Optional Exercise 11. Can you take the proof of Proposition 4.2 and turn it into one for Proposition 4.3? Example 4.27. Assume we toss a coin until we see head for the rst time, compare Exercise 59. To describe a probability space for this situation we pick the sample set {1, 2, 3, . . .} = N ∖ {0}, 148 which tells us how many times we tossed the coin until heads appeared. Again we may choose the powerset of this set as the set of events. The probability for each of these outcomes is given in the following table. 1 2 3 4 5 6 . . . 1 2 1 4 1 8 1 16 1 32 1 64 . . . which means that the probability of the outcome is = 1 2 . It is the case (but a proof is beyond the scope of this unit) that ∑︁ ∈N = ∑︁ ∈N 1 2 = 1, so this distribution satises the requirements from Proposition 59. CExercise 62. Find probability spaces to describe the various situations from Exercise 51 (a)–(l) and Exercises 57 to 59. Note that your space should describe the general situation from the question, and the specic probabilities you were asked to calculate in those exercises do not matter now. It is ne to describe these in text where you nd it dicult to use set and function notation. 4.2.4 Continuous probability distributions Sometimes it is more appropriate to have a continuous description of a problem. This is often the case when we are plotting events over time. Note that we can only talk about ‘continuous behaviour’ if we may use a sub-interval of the real numbers to describe the outcome of our probability space. See Denition 34 for a formal denition of what we mean by the discrete versus continuous case here. Example 4.28. The following curve of a function , might describe14 the probability that a piece of hardware will have failed by time . time 1 As time progresses the probability of the component having failed ap- proaches 1. But how do we turn this kind of function into a probability space? We need to identify a set of events, and we need to be able to derive the probability of that event. What we know is how to read o a probability the our device will have failed in the time interval from 0 to : That probability is given by . This is, in fact, known as a cumulative probability distribution: As time progresses the probability becomes higher and higher because the time interval 149 covered becomes bigger and bigger. In order to give a probability space we need the probability density function, which tells us the probability of the device failing at time . For the function above this is given by the function plotted below. time The relationship between the two functions is that for all in R+ we have = ∫︁ =0 . The reason for this becomes clear in Section 4.4.6. It is possible to give a probability space based on the real numbers, but the precise description is quite complicated. For completeness’ sake we note the following two facts. Fact 11 There is a -algebra ℰ on the set of real numbers R known as the Borel -algebra with the property that • all intervals [, ′], where , ′ ∈ R, are elements of ℰ . Let be any interval in R. Then we can restrict the Borel -algebra to this interval to obtain another -algebra ℰ by setting ℰ = { ∩ | ∈ ℰ}. Fact 12 Let [, ′] be an interval in R with < ′. There is a probability distribution15 to give a probability space ([, ′], ℰ [,′] , ) with the property that for any interval [, ′] in [, ′] we have [, ′] = ′ − ′ − . The probability space for Example 4.28 is then given by (R+, ℰ+ , ) where • ℰR+ is the restriction of the Borel -algebra from Fact 11 and • the probability distribution is determined by the fact that it satises , for all ≤ ′ in R+, [, ′] = ∫︁ ′ . Whenever you are asked to dene a continuous probability space you may assume that 14For an actual piece of hardware one would prefer it if the probability were to rise more slowly at rst! 15This is based on the Borel measure. 150 • you may use the Borel -algebra adjusted as in the above example and • we can calculate a probability distribution for this -algebra from any prob- ability density function (see the Denition below). So it is sucient for you to give a probability density function in this case. Denition 29: probability density function Let be a sub-interval of the real numbers. A probability density function for is given by a function : R+ with the property that ∫︁ = 1, and such that ∫︁ ′ exists for all ≤ ′ in . Tip It might seem odd that intervals play a role in calculating probabilities. Recall16 that the integral from some to some ′ over a function is the area under the curve given by from to ′. This is a generalization of adding up all the probabilities of outcomes, but this requires too much advanced maths to explain.17 So my tip is to just treat the integrals as given, and not worry too much about why that makes sense. Example 4.29. Assume you have travelled to Yellowstone National Park and want to see the famous geyser ‘Old Regular’ erupt. You know18 that it does so every ninety minutes. You are pressed for time, and when you arrive you know you can only stay for twenty minutes. What is the probability that you will see the geyser erupt in that time? We can describe this situation using the fact that we know that as time goes from 0 to 90 minutes the probability of seeing the geyser erupt rises steadily towards 1. The set of events is the Borel -algebra restricted to the interval from 0 to 90. The cumulative distribution function looks as follows. time 0 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 16If you have not covered integrals in school then the following fact should get you through most of the two extended exercises that require integrals. 17But see Example 4.82! 151 We are looking for a function (the probability density function for the cumulative mass function in the graph above) on the interval from 0 to 90 minutes with property that for all times with 0 ≤ ≤ 90 we have = ∫︁ 0 . Solving this tells us that is constant, and it only remains to calculate the constant which comes from the constraint that the integral over from 0 to 90 must be 1. Assume that = for all 0 ≤ ≤ 90. Then we need 1 = ∫︁ 90 0 = · 90, so we must have = 1/90. time 0 10 20 30 40 50 60 70 80 90 100 0.02 0.04 0.06 0.08 0.1 We can see that it does not matter when exactly you arrive - the distribution is uniform. So if you arrive at time and stay for 20 minutes then the probability that you will see the geyser erupt is given by the integral from to +20 over the function , which is given by the shaded area. time 0 10 20 30 40 50 80 90 100 0.002 0.004 0.006 0.008 0.01 + 20 This means that the desired probability, for the case = 45, is∫︁ 65 45 1 90 = 1 90 (65− 45) = 20 90 = 0.2, so the probability is just over 20%. Tip Whenever you have to describe a probability space whose set of outcomes is an interval in R you should choose the Borel -algebra restricted to that interval as your set of events. 18The most famous geyser that actually exists there, known as Old Faithful, does not erupt as regularly as my imaginary example. 152 In the unit on data science COMP13212 you will see quite a few plots of either a probability density function or for a cumulative mass function (called cumulative distribution function there), for example in the lecture on hypotheses and how to test them. EExercise 63. Describe probability density functions for the following situ- ations: (a) It is known that the probability of a component having failed rises from 0 to 1 over the time interval from 0 to 1 unit of time at a constant rate. (b) A bacterium lives for two hours. It is known that its chance of dying in any 10 minute interval during those two hours is the same. What do you think the probability density function should be? (c) Assume you have an animal which lives in a one dimensional space de- scribed by the real line R. Assume that its den is at 0, and that the probability density function has the value at that point and that it falls at a constant rate and reaches 0 when the animal is one unit away from its den. Give the probability density function for this situation. What does the corresponding cumulative probability distribution look like in this case? (If you think an animal liven in a one dimensional space is a bit limiting you can instead think of this as expressing the animal’s east/west (or north/south) distance from its den in a space of two dimensions.) (d) Try to extend the previous part to an animal that lives in a two dimensional space described by the real plane, R× R. 4.2.5 Consequences from Kolmogorov’s axioms The axioms from Denition 28 have a number of consequences that are useful to know about. We look at them one by one here and summarize them in a table at the end of the section. The empty set • The empty set ∅ is an event: Denition 28 says that if is an event then so is ∖ . Since is an event this means that ∖ = ∅ is an event. • Now that we know that ∅ is an event we may calculate its probability as follows. 1 = Denition 28 = ( ∪ ∅) ∪ ∅ = = + ∅ and ∅ disjoint, Def 28 = 1 + ∅ = 1 and so ∅ = 1− 1 = 0. 153 Intersection If we know that and are events, what can we say about ∩? We note that there is nothing in the axioms that talks about intersections. But it turns out that we can use the axioms to argue that the intersection is an event. We calculate19 ∩ = ∖ (( ∖) ∪ ( ∖)). In the following diagram ( ∖ ) ∪ ( ∖ ) is the coloured area, and the white part is its complement, that is the desired set. ( ∖) ∪ ( ∖) Since the complement of an event is an event we know that ∖ and ∖ are events, and we have seen that the union of a nite number of events is another event.20 In general there is no way of calculating the probability of ∩ from the probabilities of and . When the two events are independent then this situation changes, see Denition 30. We may summarize this as follows: • If and are events then so is their intersection ∩. • There is no general way of calculating the probability of ∩ from those of and . Complement and relative complement We begin by looking at the complement of a set. • If is an event then we know that its complement ∖ is also an event. • We also know that a set and its complement are disjoint sets whose union is . Hence we know that 1 = Denition 28 = ( ∪ ( ∖)) = ∪ ( ∖) = + ( ∖) , ∖disjoint, Def 28 and so 21 ( ∖) = 1− . 19See Exercise 7. 20Note that we can also show that the countable intersection of events is an event by generalizing this idea. 21Some people write this as (¬) = 1 − or () = 1 − , but we do not use that notation here. 154 More generally, assume we have events and . The picture shows in red and ∖ in pale blue, with violet giving the overlap. The set whose probability we wish to compute is that overlap, the darkest set in the following picture. ∖ We would like to argue that ∖ is an event. We note that the denition of a -algebra tells us that since ∖ and are events we may form another event in the form of ( ∖) ∪ and so we get an event when forming (compare Exercise 7 for the trick we employ here) ∖ (( ∖) ∪) = ∩ ( ∖) = ∖. After all this preparation we may now split the event into two disjoint events, namely = ( ∖) ∪ ( ∩), and so (compare Exercise 60) = (( ∖) ∪ ( ∩)) = ( ∖) ∪ ( ∩) = ( ∖) + ( ∩) ( ∖), ∩ disjoint, Def 28, which gives us ( ∖) = − ( ∩). We may summarize this by saying the following. • If and are events then so is ∖. • We have ( ∖) = − ( ∩). Union If we want to calculate the probability of the union of two events then in order to apply Kolmogorov’s axiom we must write it as the union of disjoint events. The Venn diagram for two non-disjoint set looks like this: ∪ 155 We can see that if we want to write ∪ as a disjoint union we have to pick for example the red and violet regions, which make up , and the blue region, which is ∖, and write ∪ = ∪ ( ∖). With the result for the relative complement we get ( ∪) = ( ∪ ( ∖)) ∪ = ∪ ( ∖) = + ( ∖) , ∖ disjoint, Def 28 = + − ( ∩) ( ∖) = − ( ∩). In summary we can say that • if and are events then so is ∪ and • we have ( ∪) = + − ( ∩). Note that if and do not overlap then ∩ = ∅ and ( ∪) = + . Order preservation Assume we have two events and with the property that is a subset of . What can we say about the probabilities of and ? Certainly we can see that is the disjoint union of and ∖, and so = + ( ∖). Since the probability of ∖ is greater than or equal to 0 we must have ≤ . Summary We give all the rules derived above. Let (,, ) be a probability space, and let and be events. Then the following rules hold. = 1 ∅ = 0 ( ∖) = 1− ( ∖) = − ( ∩) ( ∪) = + − ( ∩) ⊆ implies ≤ . It may be worth pointing out that these conditions hold for all probability spaces, in particular they also hold for the case where we are given a probability density function. The rst two conditions are trivially true, and the others are standard properties of integrals. 156 Optional Exercise 12. Convince yourself that the various equalities hold if the probability distribution is given by a probability density function. You may want to draw some pictures for this purpose. 4.2.6 Kolmogorov’s axioms revisited How should we think of the Kolmogorov axioms? The denition of a -algebra is something of a formality that ensures that the sets for which we have a probability (namely the events) allow us to carry out operations on them. We may think of the probability distribution as a way of splitting the probability of 1 (which applies to the whole set ) into parts (namely those subsets of which are events). If is nite then we only have to know how the probability of 1 is split among the elements of , and then we can assign a probability to each subset of by adding up all the probabilities of its elements. This becomes signicantly more complicated if the set is innite. Proposition 4.4 If is an innite set then there is no probability distribution which assigns the same probability to each event {} of . The simplest innite set we have met is the set of natural numbersN. If we had a probability distribution onNwhich assigned a xed probability ∈ [0, 1] to each element then it would have to be the case that the sum of all these probabilities is 1, that is ∑︁ ∈N = ∑︁ ∈N {} = 1 and there is no real number with that property. Note that the probability space dened in Fact 12 is uniform in that it assigns the same probability to intervals of the same length. So it is possible to distribute probability uniformly in two cases: • the sample set is nite, in which case we may assign the same probability, 1/||, to each outcome or • the sample set is an interval in R, in which case the probability of any one outcome is 0, but intervals can have non-0 probabilities which are determined by their length. However, there is no way of taking all the subsets of R (or any interval ), and turning that into a probability space. Proposition 4.5 Let be an interval on the real line. There is no probability distribution with the property that, (,, ) is a probability space which maps intervals of the same size to the same probability. This proposition explains why we cannot have a simpler denition of probab- ility space, where the set of events is always the powerset of the sample space. 157 4.3 Conditional probabilities and independence One of the questions that appears frequently in the context of probability theory is that of how information can be used. In other words, can we say something more specic if we already know something about the situation at hand. This section is concerned with describing how we may use the axioms of probability to make this work. 4.3.1 Independence of events Kolmogorov’s axioms are not strong enough to allow us to calculate the probability of ∩ if we know the probabilities of and . This section sheds some light on the question why there cannot be a general formula that does this. When we throw two dice, one after the other, or when we throw a coin repeatedly, we are used to a convenient way of calculating the corresponding probabilities for the outcomes. Example 4.30. Assume we record the outcome of a coin toss with for head and for tails. We assume the coin is fair and so the probability for each is 1/2. If we toss the coin twice then the possible outcomes are , , and and the probability for each is 1/4. We may calculate the probability , that is the rst coin toss 1 coming up , and the second, 2, as follows. ((1 = ) ∩ (2 = )) = (1 = ) · (2 = ) = 1/2 · 1/2 = 1/4 But it is not safe to assume that for general events and we have that the probability of ∩ can be calculated by multiplying the probabilities of and , see Example 4.31. Denition 30: independent events Given a probability space (, ℰ , ) we say that two events and are independent if and only if ( ∩) = · . What we mean by ‘independent’ here is that neither event has an eect on the other. We assume that when we throw a coin multiple times then the outcome of one toss has no eect on the outcome of the next, and similar for dice. We look at this issue again in Section 4.4.5 when we have random processes which are more easily described. In particular we talk about independence for processes with a continuous probability distribution. Example 4.31. Let us look at a situation where we have events which are not independent. In Example 4.13 we discussed pulling socks from a drawer. We assume that we have a drawer with three red and three black socks from which we draw one sock at a time without looking inside. If you pick a red sock on the rst draw, then the probability of nding a red sock on the second draw is changed. 158 The probability of drawing a red sock on the rst attempt is (1 = ) = 1/2, but what about the probability of drawing a red sock on the second attempt? Again it is best if we look at the tree that shows us how the draw progresses. 2/5 3/5 1/2 3/5 2/5 1/2 We can see that the probability of drawing a red sock on the second attempt is (2 = ) = 1 2 · 2 5 + 1 2 · 3 5 = 2 + 3 10 = 1 2 . But we can also see from the tree that ((1 = ) ∩ (2)) = 12 · 2 5 = 1 5 , which is not equal to (1 = ) · (2 = ) = 12 · 1 2 = 1 4 , so (very much expectedly) the two events are not independent. Example 4.32. A more serious example is as follows. SIDS, or ‘Sudden Infant Death Syndrome’ refers to what is also known as ‘cot death’—young children die for no reason that can be ascertained. In 1999 an ‘expert witness’ told the court that the approximate probability of a child of an auent family dying that way is one in 8500. Since two children in the same family had died this way, the expert argued, the probability was one in 73 million that this would occur, and a jury convicted a young woman called Sally Clark of the murder of her two sons, based largely on this assessment. The conviction was originally upheld on appeal, but overturned on a second appeal a few years later. While Clark was released after three years in prison she later suered from depression and died from alcohol poisoning a few years after that. What was wrong with the expert’s opinion? The number of 1 in 73 million came from multiplying 8500 with itself (although 72 million would have been more accurate), that is, arguing that if the probability of one child dying in this way is 1 8500 , then the probability of two children dying in this way is 1 8500 · 1 8500 . 159 But we may only multiply the two probabilities if the two events are independ- ent, that is, if the death of a second child cannot possibly be related to the death of the rst one. This explicitly assumes that there is no genetic or envir- onmental component to SIDS, or that there may not be other circumstances which makes a second death in the same family more likely. Since then data have been studied that show that the assumption of the independence of two occurrences appears to be wrong. While there were other issues with the original conviction it is shocking that such evidence could be given by a medical expert without anybody realiz- ing there was a fallacy involved. I hope that this example illustrates why it is important to be clear of the assumptions one makes, and to check whether these can be justied. Note that if we know that two events are independent then we may derive from that the independence of other events. Example 4.33. If and are independent events in a probability space with sample set then and ∖ are also independent. To prove this we have work out the probability of the intersection of the two events. We calculate ( ∩ ( ∖)) = ( ∖) ∖ = ∩ ( ∖) = − ( ∩) Summary of Section 4.2.5 = − · and independent = (1− ) arithmetic = · ( ∖) ( ∖) = 1− which establishes that the two given events are indeed independent. Exercise 64. Show that if and are independent then so are ∖ and ∖. A common fallacy is to assume that two events being inde- pendent has something to do with them being disjoint, that is, there not being an outcome that belongs to both. The following exercise discusses why this is far from the truth. Exercise 65. Assume that you have a probability space with two events and such that and are disjoint, that is ∩ = ∅. What can you say about , and ( ∩) under the circumstances? What can you say if you are told that and are independent? Give a sucient and necessary condition that two disjoint events are independent. 4.3.2 Conditional information For example, if I have to guess the colour of somebody’s eyes, but I already know something about the colour of their hair then I can use that information to guide my choice. 160 Example 4.34. Let us assume we have a particular part of the population where 56% have dark hair and brown eyes, 14% have dark hair and blue eyes, 3% have fair hair and brown eyes and 27% have fair hair and blue eyes. If I know a person has been randomly picked from the population, and I have to guess the colour of their eyes, what should I say to have the best chance of being right? brown eyes blue eyes dark haired 56% 14% fair haired 3% 27% We can see from the numbers given that we are better o guessing brown (lacking additional information). But what if we can see that the person in question has fair hair? In that case we are better o guessing blue. What is the appropriate way of expressing these probabilities? This example is continued below. What we are doing here can be pictured by assuming that in the sample space we have two sets, and . We are interested in the probability of (say blue eye colour) already knowing that holds (say fair hair). In the picture above this means the probability that we are in the red set , provided we already know that we are in the blue set . What we are doing eectively is to change the sample space to , and we want to know the probability of ∩. Proposition 4.6 If (, ℰ , ) is a probability space and an event with non-zero probability then a probability space is given by the following data: • sample set , • set of events { ∩ | ∈ ℰ}, • probability distribution ′ dened by ∩ ( ∩ ) . We can think of the new space as a restriction of the old space with sample set to a new space with sample set , where we have redistributed the probability entirely to the set , and adjusted all the other probabilities accordingly. 161 Optional Exercise 13. Dene a probability space that is an alternative to the one given in Proposition 4.6. Again assume that you have a probability space (, ℰ , ) and a subset of with non-zero probability. Use • sample set , • set of events: ℰ , • a probability density function that assigns to every event of the form ∩ , where ∈ ℰ , the same probability as the function given in said proposition. Optional Exercise 14. Show that the new set of events in Proposition 4.6 is a -algebra. Exercise 66. For the probability distribution ′ from Proposition 4.6 carry out the following: (a) Calculate ′. (b) For ⊆ calculate ′. (c) Show that ′ is a probability distribution. Denition 31: conditional probability Let (, ℰ , ) be a probability space, and let and be events, where has a non-zero probability. We say that the conditional probability of given is given as ( | ) = ( ∩) . It is the probability of the event ∩ in the probability space based on the restricted sample set given by Proposition 4.6. Note that if = 0 then ( | ) is not dened, no matter what is. Example 4.35. Continuing Example 4.34 we can see that the probability that a randomly selected person has blue eyes, given that he or she has fair hair, is (blue eyes | fair hair) = (blue eyes and fair hair) (fair hair) = .27 .3 = .9. In other words, if I am presented with a randomly selected person whose hair I happen to know to be fair then by guessing their eye colour is blue I have a 90% chance of being correct. On the other hand, if I can see the person has dark hair, then the chance that they have brown eyes is (brown eyes | dark hair) = (brown eyes and dark hair) (dark hair) = .56 .7 = .8. 162 Hence we can use conditional probabilities to take into account additional information we have been given before making a decision. Example 4.36. If we revisit Example 4.14 we can see that what we calculated was the probability that we have the bag given that we have seen a gold coin. According to the above ( | ) = ( ∩) = 1 3 1 2 = 2 3 , just as we concluded on our rst encounter of this example. EExercise 67. Assume that (, ℰ , ) is a probability space with events , ′ and . Further assume that ̸= 0. (a) If you know that ≤ ′ what can you say about ( | ) and (′ | )? (b) If you know that ∩ = ∅ what can you say about ( | )? (c) If you know that and are independent what can you say about ( | )? (d) If you know that ⊆ what can you say about ( | )? (e) What is ( | )? (f) How do ( ∩) and ( | ) compare? In each case justify your answer. Exercise 68. Assume you know a family with two children. (a) If you know the family has at least one girl what is the chance that both children are girls? (b) If we know that the family’s rstborn was a girl, what is the probability that both children are girls? You may assume that every birth yields a girl and a boy with equal probability. CExercise 69. Go back to the game described in Exercise 53 whose probability space is given in Example 4.26. For this exercise the game remains the same: I draw two cards from the four available ones, and then I randomly drop one of them. Below you are asked to answer a number of questions about the situation either directly after the draw, or after I have dropped a card. (a) What is the probability that I have at least one ace after the draw? (b) What is the probability that I have two aces after the draw? (c) What is the probability that the dropped card is an ace? 163 (d) What is the probability that I have the ace of spades ♠ given that I dropped a queen? (e) What is the probability that at the end of the game I have the ace of spades ♠ given that I dropped an ace? (f) What is the probability that the dropped card was a queen given that at the end of the game I have the ace of spades ♠? (g) In the original exercise you were asked to calculate the probability that the dropped card was an ace given that at the end the end of the game I have the ace of spades ♠. Express this using conditional probabilities and recalculate the answer. (h) Consider the following narrative: After the deal you ask me whether one of my cards is an ace, and I answer in the armative. You then ask me to drop a card, and to make sure I keep an ace. What is the probability that the dropped card is an ace? Exercise 70. Assume you have a probability space (, ℰ , ), and are events, and you know the following: • > 0, > 0, ( ∩) > 0; • ( | ) = ( | ) and • ( ∪) = 1. Show that > 1/2. Why is the condition ( ∩) > 0 needed? Note that it does make sense to apply the same ideas in the case where we have a probability density function. Example 4.37. Assume that you have a probability density function describing an animal’s location. Further assume that the space in question is centred on the animal’s den. Let’s assume the animal is a fox, and that we know that its presence is inuenced by the presence of another animal, say a lynx. To make the situation simpler let’s say that the fox avoids a circle around the lynx. If we assume the fox avoids the lynx completely then the fox being in the white area of its range, given there is a lynx at the centre of the white circle, has a probability of 0. But that means the ‘mass’ of probability that resided in 164 the white area has to go somewhere else (since the overall probability that the fox is somewhere in the area has to be equal to 1)! In the above example we haven’t got enough information to decide where it goes. Further if the situation is more interesting, and the lynx only inhibits, but does not prevent, the foxes presence, the analysis is more complicated. We return to this question in Section 4.4.5 where we restrict how we think of the events that occur, which makes it substantially easier to mathematically describe the situation. 4.3.3 Equalities for conditional probabilities From the denition of the conditional probability we may derive some useful equalities. Recall that the probability of an event conditional on another is dened only if the latter has a probability greater than 0, but the following equality is true even if the probability is 0: ( | ) · = ( ∩), which is also known as the multiplication law. Note that the expression on the left hand side is symmetric in and since ∩ = ∩ and so we have ( | ) · = ( ∩) = ( | ) · . If we like we can use this equality to determine ( | ) from ( | ), provided that ̸= 0. The equality ( | ) = ( | ) · , is known as Bayes’s Theorem. It allows us to compute the probability of given , provided we have the probabilities for given , and . Example 4.38. Revisiting Example 4.35 we have calculated the probability that a fair-haired person has blue eyes. What about the probability that a blue-eyed person has fair hair? Using Bayes’s law we have (fair hair | blue eyes) = (blue eyes | fair hair) · (fair hair) (blue eyes) = .9 · .3 .41 ≈ 65.9%. On the other hand the probability that a brown-eyed person has dark hair is (dark hair | brown eyes) = (brown eyes | dark hair) · (dark hair) (brown eyes) = .8 · .7 .59 ≈ 95%. There are further equalities based around conditional probabilities that can be useful in practice. Sometimes the sample space can be split into disjoint events, where we know something about those. 165 In particular, given an event we know that and ∖ cover the whole sample space . This means we know that (see Exercise 60) = ( ∩) ∪ ( ∩ ( ∖)), and this is a disjoint union. By Kolmogorov’s axioms given in Denition 28 this implies = (( ∩) ∪ ( ∩ ( ∖))) = ( ∩) + ( ∩ ( ∖)), and if we use the multiplication law twice, and the properties for probability distributions as needed, then we obtain the following rule. = ( | ) · + ( | ∖) · ( ∖) = ( | ) · + ( | ∖) · (1− ). (*) This law is a special case of a more general one discussed below. But even this restricted version is useful, for example, when there is a given property and whether or not that property holds has an inuence on whether a second property holds. Note that if we pick so that its probability is either 0 or 1 then the law does not help us in calculating the probability of . Example 4.39. Assume that motherboards from dierent suppliers have been stored in such a way that it is no longer possible to tell which motherboard came from which supplier. Further assume that subsequently it has become clear that those from Sup- plier 1 (1) have a 5% chance of being faulty, while that chance is 10% for ones from Supplier 2 (2). It is known that 70% of supplies in the warehouse came from Supplier 1, and the remainder from Supplier 2. What is the probability that a randomly chosen motherboard is defective? The rule (*) from above tells us that (defect) = (defect | from S1) · (from S1) + (defect | from S2) · (from S2) = .05 · .7 + .1 · .3 = .065. Example 4.40. The following is an important case that applies to diagnostic testing in those cases where there is some error (certainly medical tests fall into this category). Assume a test is being carried out whether some test subject suers from an undesirable condition. From previous experience it is known that • if the subject suers from the condition then with a probability of .99 the test will show this correctly and • if the subject does not have the condition then with a probability of .95 the test will show this correctly. We assume that for an arbitrary member of the test population the chance 166 of suering from the condition is .00001. If a subject tests positive for the condition, what is the probability that they have the condition? We would like to calculate (has condition | test positive). We do not have this data given, but we do have (test pos | has cond) and (has cond). If we apply Bayes’s theorem we get (has cond | test pos) = (test pos | has cond) · (has cond) (test pos) . We miss (test pos), but we may use rule (*) above to calculate (test pos) = (test pos | has cond) · (has cond) + (test pos | doesn’t have cond) · (doesn’t have cond) = .99 · .00001 + .05 · .99999 ≈ .05 So we may calculate the desired probability as (has cond | test pos) = (test pos | has cond) · (has cond) (test pos) ≈ .99 · .00001 .05 ≈ .0002. So if we test something, and in the event it tests positive, there’s a .02% chance that the subject is ill, would we think this is a good test? The issue in this example is the extremely low probability that anybody has the condition at all. If we change the numbers and instead assume that the chance that an arbitrary member of the test population has the condition is .1 then we get (test pos) = (test pos | has cond) · (has cond) + (test pos | doesn’t have cond) · (doesn’t have cond) = .99 · .1 + .05 · .9 = .144 and (has cond | test pos) = (test pos | has cond) · (has cond) (test pos) = .99 · .01 .144 167 = .06875. So in this case the chance that a subject that tests positive has the condition is almost 69%. In general when you are given the outcome of a test you should ideally also be given enough data to judge what that information means! Example 4.41 (The Monty Hall problem, again). At this point we have all we need to revisit the Monty Hall problem to look at it from the point of view of using conditional probabilities. We wish to calculate (car behind door we switch to | goat behind door opened by Monty). Let’s introduce some shortcut notation: • We use behind for the event that the car is behind the door we would switch to. • We use revealed for the event that a goat has been revealed behind door . The question then becomes how we may calculate the probability we are interested in. In particular, if we use the denition for conditional probabilities we need to know the probability that the car is behind door and the goat is behind door , which is no easier to determine. If instead we employ Bayes’ law, we are looking for ( revealed | behind ) · ( behind ) ( revealed ) . That looks more manageable since • the probability that Monty reveals a goat behind door given that the car is behind door is 1, since we know the car is behind the door would switch to, which means that we have picked a door with a goat behind, and so Monty has no choice but to open door to show us the other goat and • the probability that the car is behind door is 1/3. What about ( revealed )? We may use the law of total probability to calculate that. So far we have the door we would switch to , the door that Monty has opened to reveal a goat , and we use for the door we chose originally. We obtain ( revealed ) = ( revealed | behind ) · ( behind ) + ( revealed | behind ) · ( behind ) + ( revealed | behind ) · ( behind ). What can we say about the numbers that appear here? 168 • The probability that the car is behind door is 1/3, in which case there is exactly one door (namely ) that Monty can open to show a goat, so ( revealed | behind ) is 1. • The probability that Monty opens door to reveal a goat if the car is behind door is 0. • We know that the probability that the car is behind our originally chosen door is 1/3, and the probability that the goat is revealed behind door is 1/2, since there are two doors with goats behind them that Monty could open. 1 · 1 3 + 0 + 1 2 · 1 3 = 2 6 + 1 6 = 3 6 = 1 2 , and overall we get ( revealed | behind ) · ( behind ) ( revealed ) = 1 · 1 3 1 2 = 2 3 as we expect. Note that using conditional probabilities allowed us to reason about this situation without worrying about which of the doors is , or ! Our rule (*) from above is a special case of a more general law. Instead of splitting the sample space into two disjoint sets, and ∖, we split it into more parts. If 1, 2,. . . , is a collection of pairwise disjoint events such that ⊆ 1 ∪2 ∪ · · · ∪ then it is the case (see Exercise 60) that = ( ∩1) ∪ ( ∩2) ∪ · · · ∪ ( ∩), and by Kolmogorov’s axioms given in Denition 28 we may use the fact that the ∩ , for 1 ≤ ≤ , are pairwise disjoint (again see Exercise 60) to calculate the probability of as = ( ∩1) + ( ∩2) + · · ·+ ( ∩) Def = ( | 1)1 + ( | 2)2 + · · ·+ ( | ) mult law = ∑︁ =1 ( | ) · . This is sometimes referred to as the law of total probability. The way to think about it is that if we split the event into disjoint parts of the form ∩, then the probability of can be recovered from the probabilities of the parts, and the probabilities of these parts can be calculated using the multiplication law. Splitting a set into pairwise disjoint parts is also known as partitioning the set, and so we can think of this law as telling us something that the probability of an event can be recovered from the probabilities of its parts, provided the probabilities for the parts can be calculated from the given data using the multiplication law. The law of total probability is used for a procedure known as Bayesian updating which is discussed in the following section. Examples for the application of this rule can be found there. 169 Summary For events and , with ̸= 0, and pairwise disjoint collections of events (), where ∈ N, we have the following laws concerning total probabilities: ( | ) · = ( ∩) ( | ) = ( | ) · = ( | ) · + ( | ∖) · (1− ). = ∑︁ =1 ( | ) · if ⊆ ⋃︁ =1 In the course unit on data science you will use these laws in order to derive information from data, in particular in the part about Bayesian statistics. Example 4.42. You have a friend who likes to occasionally bet on a horse, but no more than one bet on any given day. From talking to him about his bets, you have some statistical data. There’s a ve percent chance that he’s won big and a twenty-ve percent chance that he has won moderately, or else he has lost his stake. If he has won a signicant amount of money there’s a seventy percent chance that he has gone to the pub to celebrate, and if he’s lost there’s an eighty percent chance that he has gone to drown his losses, whereas if he’s won a small amount there’s only a twenty percent chance that you’ll nd him in the pub. If you know he has placed a bet today, and you go to the pub, what’s the chance that you will nd him there? We can use the law of total probability to help with that. We partition the overall space into your friend having won big, moderately, or not at all. We know the probabilities for each of these events, and also the conditional probability that he is in the pub for each of those, so the overall probability is 70 100 · 5 100 + 20 100 · 25 100 + 80 100 · 70 100 = 7 10 · 1 20 + 2 10 · 5 20 + 8 10 · 14 20 = 7 + 10 + 112 200 = 129 200 = .645, so there’s a 64.5% chance that you’ll nd him in the pub. Exercise 71. Let (, ℰ , ) be a probability space, and assume that , and are events. What might we mean when we refer to the probability of , given , given ? Can you nd a way of expressing that probability? You may assume that , and ∩ all have non-zero probabilities. Exercise 72. Prove that the law of total probability holds. CExercise 73. Assume that you have found the following statistical facts about your favourite football team: • If they score the rst goal they win the game with a probability of .7. 170 • If they score, but the other team scores the rst goal, your team has a probability of .35. of winning the game. • If your team scores then the probability that the game is a draw is .1. You have further worked out that in all the matches your team has played, in 55% of all games they have scored, and in 40% of those they have scored rst. What is the probability that your team wins a randomly picked game? After further analysis you have worked out that they lose 80% of all games in which they haven’t scored. What is the probability that a randomly picked game your team is involved in is a draw? Exercise 74. One of your friend claims she has an unfair coin that shows heads 75% of the time. She gives you a coin, but you can’t tell whether it’s that one or a fair version. You toss the coin three times and get . What is the probability that the coin you were given is the unfair one? Exercise 75. Assume you have an unfair coin that shows heads with probability ∈ (0, 1]. You toss the coin until heads appears for the rst time. Show that the probability that this happens after an even number of tosses is 1− 2− . This is a tricky exercise. It depends on cleverly choosing events, and using the law of total probability. Exercise 76. Consider the following situation: Over a channel bits are transmit- ted. The chance that a bit is correctly received is . From observing previous trac it is known that the ratio of bits of value 1 to bits of value 0 is 4 to 3. If the sequence 011 is observed what it the probability that this was trans- mitted? 4.3.4 Bayesian updating In AI it is customary to model the uncertainty regarding a specic situation by keeping probabilities for each of the possible scenarios. As more information becomes available, for example through carrying out controlled experiments, those probabilities are updated to better reect what is now known about the given situation. This is a way of implementing machine learning. It is also frequently used in spam detection software. In this section we look at how probabilities should be updated. Example 4.43. Assume you are given a bag with three socks in it. You are told that every sock in the bag are either red or black. You are asked to guess how many red socks are in the bag. There are four cases: {0, 1, 2, 3}. We model this situation by assigning probabilities to the four. At the begin- 171 ning we know nothing, and so it makes sense to assign the same probability to every one of these. Our rst attempt at modelling the situation is to set the following probabilities. Original distribution 0 1 2 3 1/4 1/4 1/4 1/4 This expresses the fact that nothing is known at this stage. Assume some- body reaches into the bag and draws a red sock which they hold up before returning it to the bag. No we have learned something we didn’t know before: there is at least one red sock in the bag. This surely means that we should set 0 to 0, but is this all we can do? The idea is that we should update all our probabilities based on this in- formation. The probability () that we have red socks in the bag should become ( | ), that is, it should be the probability that there are red socks given that the drawn sock was red. Bayes’s Theorem helps us to calculate this number since it tells us that ( | ) = ( | ) · () () . Let us consider the various probabilities that occur in this expression: • ( | ). This is the probability that a red sock is drawn, given the total number of red socks. This is known, and it is given by the following table: 0 1 2 3 ( | ) 0 1/3 2/3 1 So if the number of red socks is then the probability of ( | ) is /3. • (). We don’t know how many red socks there are in the bag, but we are developing an estimated guess for the probability, and that is what we are going to use. So where this appears we use the probabilities provided by the rst table, our original distribution. • (). This is the the probability that the rst sock drawn is red, in- dependent from how many red socks there are. It is not clear at rst sight whether we can calculate that. The trick is to use the law of total probability, as described below. We should pause for a moment to think about what the underlying prob- ability space is here to make sensible use of the law of total probability. In the table above we have assigned probabilities to the potentially possible numbers of red socks in the bag. But by drawing a sock from the bag we have expanded the possible outcomes: These now have to be considered as combinations:: They consist of the number of red socks in the bag, plus the outcome of drawing a sock from the bag. We can think of these as being encoded by • a number from 0 to 3 (the number of red socks in the bag) and 172 • a colour, or , denoting the outcome of the draw. In other words, for the moment we should think of the sample space as {0, 0, 1, 1, 2, 2, 3, 3}. Note that our original outcome now becomes a shortcut for the event {, }. If we draw further socks from the bag then each current outcome will become an event {, }. Returning to the probability that a red sock is drawn, (), we can now see that this is the probability of the event {0, 1, 2, 3}. Since we can split this event into the disjoint union of {0} ∪ {1} ∪ {2} ∪ {3}, the law of total probability tells us that () = ( | 0) (0) + ( | 1) (1) + ( | 2) (2) + ( | 3)3 = 0 · 1/4 + 1/3 · 1/4 + 2/3 · 1/4 + 3/3 · 1/4 = 1/2. This should be no surprise: At the moment all the events 0 to 3 are considered to be equally likely, which gives us a symmetry that makes drawing a red and drawing a black sock equally likely, based on what we know so far. We use this information to update our description of the situation. First update 0 1 2 3 0 1/6 2/6 = 1/3 3/6 = 1/2 Note that the probability that there is just one red sock has gone down, and that the sock are all red has gone up the most. Assume another sock is drawn, and it is another red sock. This extends the sample space in that events are now of the form , , , . But the way most implementations of the algorithm work is not to look at it from that point of view. Instead of keeping track of the colour of the socks drawn so far the assumption is that everything we know about what happened so far is encoded in the probabilities that describe what we know about the current situation. 173 This has the advantage that what we have to do now looks very similar to what we did on the previous round of updates, and it means that one can write code that performs Bayesian updating which works for every round. So again we are seeking to update () by setting it to ( | ) = ( | ) · () () , where now the () are those calculated in the previous iteration, the rst update to the distribution. Note that the value of () has changed. It is now () = ( | 0) (0) + ( | 1) (1) + ( | 2) (2) + ( | 3)3 = 0 · 0 + 1/3 · 1/6 + 2/3 · 1/3 + 3/3 · 1/2 = 7/9. The updated probabilities are Second update 0 1 2 3 0 1/14 4/14 9/14. If instead the second drawn sock had been black then we would have to update () to ( | ) = ( | ) · () () , where we can read o the probabilities of drawing a black sock given that there are a given number of red socks from the table 0 1 2 3 ( | ) 1 2/3 1/3 0 which means that ( | ) = (3− )/3, and based on the probabilities after the rst update we have () = ( | 0) (0) + ( | 1) (1) + ( | 2) (2) + ( | 3)3 = 3/3 · 0 + 2/3 · 1/6 + 1/3 · 1/3 + 0 · 1/2 = 2/9. leading to updated probabilities of Alternative second update 0 1 2 3 0 1/2 1/2 0. Note that in this case, the probabilities for both cases that have been ruled out, 0 and 3, have been set to 0. Based on what we have seen in this situation, that is a red sock being drawn followed by a black one, it seems reasonable to have the probabilities for the remaining options to be equal 174 We can see that Bayesian updating is a way of adjusting our model of the current situation by updating the probabilities we use to judge how likely we are to be in any of the given scenarios. The preceding example is comparatively simple, but there are two issues worth looking at in the context of this example. The rst of these is already hinted at in the example: What is the underlying probability space in a case like this? The sample space changes with the number of socks drawn—one might think of it as evolving over time. At the stage when socks have been drawn from the bag the outcomes are best described in the form of strings 12 · · ·, where ∈ {0, 1, 2, 3} and ∈ {,} for 1 ≤ ≤ . In other words, each outcome consists of the number of red socks, and the result of the sock draws conducted. As we move from one sample space to the next each outcome 12 · · · splits into two new outcomes, 12 · · · and 12 · · ·. Note that what is happening here is that the number of red socks in the bag is xed for the entirety of the experiment, and so the actual probability distribution for the rst probability space (before the rst sock is drawn) is one which • assigns 1 to the actual number of socks and • 0 to all the other potential numbers of red socks under consideration. If, for example, the number of red socks in the bag is 1 then the actual probability distribution is 0 1 2 3 0 1 0 0. Under those circumstances, the actual probabilities for the probability space based on the set of outcomes {0, 0, 1, 1, 2, 2, 3, 3}. is 0 0 1 1 2 2 3 3 0 0 1/3 2/3 0 0 0 0. What Bayesian updating is trying to do is to approximate this actual probability distribution for the original set of outcomes in a number of steps. Note that since we do not know what the actual distribution does, it is at rst sight surprising that with what little information we have, we can write a procedure that will succeed in approximating the correct distribution. The probabilities used for are quite dierent from the actual ones given above. But if we keep conducting our random experiments then our approximated distribution will almost certainly converge towards the actual distribution—see Fact 13 for a more precisely worded statement. 175 The underlying probability space is one where the set of events is the powerset of the sample space, but many events have the probability 0. We don’t know what the distribution is, and so we cannot describe that space and use that description in our procedure. Note that we are very careful about which events play a role in our calculation. These are of two kinds: • The rst kind consists of events whose probability we are trying to estimate. These are the outcomes from the original sample space which expand into events whose number of elements doubles each time we draw a sock. • The second kind consists of events whose approximated probability is calcu- lated by forming a ‘weighted average’ over all the events of the rst kind. In other words, we are using all the data from our current approximation to give an approximated probability for those events. In the example, this is the probability of drawing a red/black sock. The aim here is to ensure that we do not introduce any additional uncertainty or bias into our calculations. The reason Bayesian updating is so useful is that it allows us to approximate the unknown probability distribution by conducting experiments (or observing events), with very little information being required for the purpose. At each stage we treat the present approximating distribution as if it were the actual distribution, and we are relying on the idea that over time, the available information will tell us enough to ensure that our approximation gets better. Note that it will not necessarily get better on every step—whenever a comparat- ively unlikely event (according to the actual distribution) occurs, our approximation is going to get worse on the next step! But there is the Law of Large Numbers Fact 13 which can be thought of as saying that if we keep repeating the same experiment (drawing a sock from the bag) often enough, then almost certainly we will see red socks appearing in the correct proportion. It is worth pointing out that the idea in Bayesian updating relies on us being able to perform the same experiment more than once—if we don’t put the drawn sock back into the bag the idea does not work.22 An interesting question is also what we can do if the number of socks in the bag is unknown. It is possible instead to consider the possible ratios between red and black socks. The most general case would require us to cope with innite sums (since there are innitely many possible ratios), and that is beyond the scope of this unit. Note also that there is no way of starting with a probability distribution on all possible ratios that assigns to each ratio the same probability in the way we did here, see Proposition 4.4. If the number of possible ratios is restricted, however, then one may employ the same idea as in the example above, see Exercise 79. A fun example for Bayesian updating, created by my colleague Gavin Brown, can be found here: http://www.cs.man.ac.uk/~gbrown/BTTF/. We present a toy23 version of the following example to help you understand the more complex considerations in that example. 22Of course, if the drawn sock is not returned then after three draws how many red socks were in the bag originally. 23What do you call a toy version of a toy version of an example? 176 Example 4.44. Assume you have a robot that is in a room of size × metres that has been split into four quadrants. The robot has a sensor, indicated by the line, which is known to face west. The robot would like to determine which quadrant it is in. To do so it can invoke its sensor, which will detect how far it is to the nearest wall. Depending on whether the measured distance is smaller than /2 or larger than /2 the robot can then deduce whether it is in the quarter adjacent to that wall or not. However, the sensor is inaccurate, and will report a wrong distance 1/4 of the time. Assume the robot has some information encapsulated in the following distribution: Original distribution 0 2/6 3/6 1/6. Assume the robot conducts a sensor reading and nds that the distance to the wall is less than /2. We use for ‘close’ to record this outcome (one might use for ‘far’ if the measured distance is greater than /2). We can determine the probability that it gets this reading for each of the four possibilities: ( | ) 3/4 1/4 3/4 1/4. It is close to the wall if it is in one of the two western quadrants, and it gets the correct reading with probability 3/4. If it is in one of the eastern quadrants then it shouldn’t get this reading unless the measurement is inaccurate, which happens with probability 1/4. Using the law of total probability we may now calculate the probability that the robot gets this reading as () = ( | ) · ( ) + ( | ) · () + ( | ) · ( ) + ( | ) · () = 1 4 · 1 6 (3 · 0 + 1 · 2 + 3 · 3 + 1 · 1) 177 = 1 4 · 1 6 · 12 = 1 2 . We may now use Bayes’ Theorem to nd the formula for the updated probabilities of our distribution. Assume is one of the for quadrants. Then we want to update to ( | ) · , leading to the following. Updated distribution 0 2/12 9/12 1/12. We can see that the probability that the robot is in the quadrant has gone up substantially—and indeed, given the fact that this was already the most likely position, the sensor reading further conrmed that opinion. The following example is a more complicated version of the previous one but considerably simpler than the one that used to appear in an AI lab. Example 4.45. We extend the previous example as follows: It is not known which way the robot is facing. The location and orientation of the robot can then be described by a string of length 3, made up from the symbols {,, ,}: The rst two of the symbols give the quadrant in which the robot is, and the third its orientation. In the picture in Example 4.44 you can see a robot in state The rst symbol has to be or , and the second symbol has to be or , so altogether there are 2 · 2 · 4 = 16 possible states the robot could be in: Again we assume that there is a probability distribution regarding which state the robot is in, either by assigning the same probability to each possible outcome, or by using partial information the robot has. The robot is using a probability space where the outcomes are as in the table above. Since this is a nite set we can calculate the probability for each potential event, that is each subset of the sample space, by having a probability for each of the sixteen cases. The robot can perform the same sensor readings as before. This means there is an event of taking a sensor reading, and the outcome can be that the nearest wall in the direction the robot is facing can be less than /2 or more than /2. Hence we should think of the sample space as being given by strings of length four, where the last symbol tells us whether the wall is close () or far ( ). The robot, however, is only interested in the events consisting of the outcome where the last symbol has been ignored. 178 So where in the table above we wrote, for example, , the underlying event is really {,}. We call these events ‘status events’ because they tell us the potential status of the robot. Querying the sensor is another event, which we can think of as getting the reading , or getting the reading , where the former is given by the set {,,,, ,,,, , , , , ,,, }. It is convenient to abbreviate that event with . Based on what we’ve said above it should be clear that we know something about the conditional probabilities for sensor readings. If the position of the robot is then the nearest wall is close. The probability that the sensor reading will be is therefore 3/4 (because the sensor is correct 75% of the time), and 1/4 that the reading will be . This means that we know that ( | ) = 3/4, and similarly we can determine the conditional probabilities for and given the various other status events. How should the robot update information about its status? It should apply Bayesian updating. When the robot performs a sensor reading it should update the probability for all status events to reect the result. If the sensor reading returns then the probability that the robot is in, for example, square should reduce, since if everything works properly the sensor should return in that situation. The new value for the probability of should be the probability of given the outcome . In other words, we would like to set () to ( | ). To calculate that probability we can use Bayes’s Theorem which tells us that ( | ) = ( | ) · () . We know that ( | ) is 1/4, and we know the current probability for . Hence it only remains to calculate . For this remember that is a shortcut for all events of the form ??? , that is, the last symbol is . We have a pairwise disjoint collection of events with the property that is a subset of their union, since = {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} 179 = ⋃︁ ∈{,}, ∈{,},∈{,,,} { }. Hence we may use the law of total probability to deduce that = ∑︁ ∈{,}, ∈{,},∈{,,,} ( | ) · ( ). This means we now can calculate the updated probability for . In general, given a status event (for location), the robot should update the probability for to account for the outcome of querying the sensor, so if the outcome is , it should set () to ( | ). More generally, if we use (for distance) for an element of the set {,} then the robot should set () to ( | ), after it has observed the event . How do we calculate this? We are given • the probabilities (), • the probabilities ( | ) for ∈ {,}. As discussed above Bayes’s Theorem allows us to calculate the desired probab- ility. It tells us that for each status event we have ( | ) = ( | ) · . Looking at the probabilities that appear on the right hand side of this equality, we know ( | ) from the basic setup (information about the robot’s sensor), and we have a value for since that is what the robot is keeping track of. What about ? Remember that this is a shortcut for all events of the form ???, so re- peating what we have done above for the case where is equal to we can see that we have a pairwise disjoint collection of events with the property that is a subset of their union, since = {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} ∪ {} = ⋃︁ ∈{,}, ∈{,},∈{,,,} { }. Hence we may use the law of total probability to deduce that = ∑︁ ∈{,}, ∈{,},∈{,,,} ( | ) · ( ). 180 The expressions we have found here get quite unwieldy. We show how to adapt that notation to our toy example, and give these equalities using that notation. Instead of writing to describe the potential location and orientation of the robot, let’s call the events in question ,,, where • ∈ {0, 1}, where 0 stands for and 1 for , • ∈ {0, 1}, where 0 stands for and 1 for , and • ∈ {0, 1, 2, 3}, where 0 stands for , 1 for , 2 for and 3 for . Our encoding means that 0,0,3 is equivalent to the status event . We can then write the update rule for the probabilities as follows: After a sensor reading resulting in (where is still in {,}), the probability ,, should be set to (,, | ) = ( | ,,) · ,,∑︁ ′∈{0,1},′∈{0,1},′∈{0,1,2,3} ( | ′,′,′) · ′,′,′ = ( | ,,) · ,,∑︁ ′,′,′ ( | ′,′,′) · ′,′,′ , where the last line is a short-cut for the case when it is understood what values the variables ′, ′ and ′ are allowed to take. In general, Bayesian updating is performed in the situation where we have the following. • There are a number of possibilities that may apply, say 1, 2,. . . which are events in some probability space such that they are disjoint, and their union is the whole sample space. It is assumed that there are estimates () for all 1 ≤ ≤ . • There is a way of collecting information about the situation, in such a way that there are a number of outcomes 1, 2, . . . , , and so that each event can be thought of as = {1, 2, . . . , }. • When the outcome is observed then for each its probability is updated to ( | ) = ( | ) · () () , where it is assumed that ( | ) is known for all combinations, and where the calculation of () is performed as () = ∑︁ =1 ( | ) · (), 181 giving an overall update of () to ( | ) · ()∑︁ =1 ( | ) · () . Tip To perform Bayesian updating you need to perform the following steps: • Determine the possibilities you want to distinguish between,say 1, 2, to. Initialize the probability distribution by setting all probabilities to be equal, unless you have further information. • Determine which random experiment you may conduct to nd out more about the given situation. For each outcome of this experiment, and for each possibility from the rst step, determine ( | ). You must be able to nd these numbers from the description of the situation. These numbers are used on every step of the calculation and they do not change. • Assume you carry out the experiment once, and nd the outcome . Calculate = ( | 1) · 1 + ( | 2) · 2 + · · ·+ ( | ) · , where 1, 2, . . . are all the possibilities determined in step 1, the () come from the current estimate of the probability distribution, and the (|) were determined in step 2. This number has to be recalculated after each update to the distribution. • Update the probability distribution by setting () = ( | ) · () . where these numbers were determined in the previous steps, and repeat from step 3. Example 4.46. In Example 4.44 we have the following: • The possibilities we are trying to distinguish between are the four quadrants. • The outcomes of the experiment that can be conducted to collect further information are and . • One may determine the probability of recording given that the current position is (and similarly for ). The table for is given in the example. We give the table for here: 182 ( | ) 1/4 3/4 3/4 1/4. Note that if the robot turns to face in a dierent direction then these numbers change. • One may now compute (or ) using the law of total probability. • It is now possible to update the distribution using Bayes’s Theorem, and then one repeats from step 3. Example 4.47. In Example 4.43 we have the following. • The possibilities we are trying to distinguish between are the possible number of red socks, that is 0, 1, 2 or 3, At the start the distribution assigns to each outcome the probability 1/4. • The outcomes of the experiment we can conduct repeatedly are the two possible colour of the sock drawn, and . • The probability of drawing a red sock if the total number of red socks is as /3 (and the probability of drawing a black sock as 1− /3). • One may now calculate () using the law of total probability. • One may now update the distribution using Bayes’s Theorem, and then repeat from step 3. Example 4.48. In Example 4.45 we had the following. • The possibilities we are trying to distinguish between are the various quadrants and the direction in which the robot’s sensor is facing. We assume there is a given probability distribution at the start. • The outcomes of the experiment we could conduct repeatedly are the two possible outcomes of using the sensor, and . • We determine the probability of getting (or ) for each possibility based on the probability of the sensor working accurately, and the currently assumed situation using the law of total probability. • Using Bayes’s Theorem one may use this data to update the probability distribution. Every time we conduct an experiment we update our estimate of the prob- ability distribution underlying the situation. In Bayesian statistics the following terminology is used: • Let describe parameters whose probability distribution we are aiming to approximate. • Let describe evidence that we may collect (for example by carrying out a random experiment). 183 • Let describe a particular outcome of that random experiment. • Let be the probability distribution for , where we use the current best approximation. ( | ) = ( | ) · () posterior priorlikelihood evidence The viewpoint there is that • The posterior is the updated distribution based on what we know so far, or more generally in Bayesian statistics it describes what we want to know. It’s called ‘posterior’ because it is what we know after we have collected (more) data. • The prior describes our belief before we acquire more data/evidence. • The likelihood is the probability that happens given the parameters in . • The evidence, also referred to as normalization can be hard to nd—in Bayesian updating it’s the best approximation to the probability that the observed event does happen. You will meet these ideas once again in the data science unit in Semester 2. CExercise 77. Imagine your friend claims to have an unfair coin, which they give to you. From the rather vague description they gave you you aren’t sure whether the coin is fair, or whether it gives heads with probability 3/4, or whether it gives tails with that probability. You want to conduct Bayesian updating to work out which it is. You are going to mimic having the coin as follows: Take two coins. Every time you would toss our ctitious coin, toss both your coins. If at least one of them shows , assume the result was , else assume it was . The above procedure allows you to mimic an unfair coin using two fair ones. Follow the instructions to carry out three coin tosses and the corresponding Bayesian updating steps. Hint: Read the text carefully: How many possibilities for the coin are there that you are trying to distinguish between? Note that I expect you to really use a random device, and therefore for dierent students to have dierent sequences of coin tosses! Exercise 78. Assume a friend is trying to send you a message which consists of ‘yes’ or ‘no’. He’s a bit mischievous, and what he is actually going to do is tell three of your friends something which he claims you can decode into a ‘yes’ or a ‘no’ each time. You are very sceptical about whether you will be able to extract the correct message from your friends, and you only give yourself a 60% chance to do so 184 correctly in each case. Carry out Bayesian updating to determine your friend’s answer. Assume that the messages you extract from your three friends are ‘yes’, ‘yes’ and ‘no’ in that order. What do you think of the nal distribution? How condent are you that you have decoded the message correctly? Exercise 79. Consider Example 4.43. Instead of knowing the total number of socks, all you know is that the ratio of red to black socks is an element of the following set: {1/4, 1/3, 1/2, 2/3}. What is the Bayesian update rule for this situation? Assume a black sock is drawn, followed by a red one. Starting from a probability distribution that assigns the value of 1/4 to each ratio, give the updated probabilities for each of the given ratios after each draw. Optional Exercise 15. Assume you are asked to perform Bayesian updating in a case where there are only two possible options, and where information is gained by performing an experiment which also has two possible outcomes. The resulting case can be described using three parameters: • The probability that we have assigned to the rst case’ • the probability that tells us how likely Outcome 1 is if we are in Case 1 and • the probability that tells us how likely Outcome 1 is if we are in Case 2. Write down the rule for a Bayesian update in this situation. Can you say anything about subsequent calculations? 4.4 Random variables Often when we study situations involving probabilities we want to carry out further calculations. For example, in complexity theory (see COMP11212 and COMP26120) we are frequently looking for the ‘average case’—that is, we would like to know what happens ‘on average’. By this one typically means taking all the possible cases, each weighted by its relative frequency (not all cases my be equally frequent), and forming the average over all those. For examples of what is meant by an ‘average case’ for two search algorithms see Examples 4.96 to 4.99. But in order to carry out these operations we have to be in a situation where we can calculate with the values that occur. If we look at some of the examples studied then we can see that some of them naturally lend themselves to calculating averages (it is possible, for example, to ask for the average number of eyes shown when throwing two dice), and some don’t (there’s no average colour of a sock drawn from one of our bags of socks). This is why people often design questionnaires by giving their respondents a scale to choose from. The university does this as well: When you will be asked to 185 ll in course unit questionnaires for all your units, then part of what you are asked to do is to assign numbers. ‘On a scale of 1 to 5, how interesting did you nd this unit.’ This allows the university to form averages. But what does it mean that the average interest level of COMP11120 was 3.65 (value from 2014/15)?24 Certainly every time you assign numbers so that you may form averages, you should think about what those numbers are supposed to mean, and whether people who are asked to give you numbers are likely to understand the same as you, and as each other, by those numbers. Nonetheless, forming averages can be a very useful action to perform, and that is why there is a name given to functions that turn the outcomes from some probability spaces into numbers. We see below that this does not merely allow us to calculate averages but also to describe particular events without knowing anything about the events or outcomes from the underlying probability space. 4.4.1 Random variables dened Random variables are functions that translate the elements of a sample space, that is the possible outcomes from a random experiment, to real numbers. But this translation has to happen in such a way that we know what the probabilities for the resulting numbers are, and that requires a technical denition. In order to formulate that we have to dene an additional concept. Denition 32: measurable function Let (, ℰ , ) be a probability space. The function : R is measurable if and only if for all elements of R the sets • { ∈ | ≤ } and • { ∈ | ≤ } are events, that is, elements of ℰ . Note that in the case where ℰ = , as is often the case for applications, every function from to R is measurable. Denition 33: random variable Given a probability space (, ℰ , ) a random variable over that space is a measurable function from to R. Example 4.49. When we toss a coin, but record the outcomes as numbers, say 0 for heads and 1 for tails, you have a random variable. Example 4.50. If you have a population whose height distribution you know 24On these questionnaires they try to make the numbers slightly more meaningful by assigning 5 to ‘agree’ and 1 to ‘disagree’, but when does one move from one grade to another? Is it really meaningful to average those out? 186 (compare Example 4.60) you may think of randomly picking a person and recording their height as a random variable. Example 4.51. When you’re playing a game of chance, and you assign a value of −1 to losing, 0 to a draw and 1 to a win, you have a random variable. You could also give 3 for a win, 1 for a draw, and 0 for a loss, and that would also result in a random variable. Note that it is often tempting to dene a random variable as a function from a sample set to a subset ofR. Strictly speaking this does not satisfy the above denition. One should instead make the target set of the random variable R and observe that its range is a proper subset of R. If one changes the denition above then many of the results and denitions below become more complicated. Theorem 4.11 gives a technical result that argues that we could, instead of looking at all of R, restrict ourselves to the range of the function from the start. Note that whenever we assign numbers to outcomes, and then carry out calcu- lations with those numbers, we have to worry about whether our interpretation of those numbers makes sense. In game theory it is customary to use any items (money or points) won or lost to encode the outcome of a game in a number, but that may not be a faithful description of what a win or loss means to the individual playing. Whenever you have a probability space (, ℰ , ) such that the set of outcomes is a subset of R then you have a probability variable, provided you can calculate the probabilities of all sets of the form ∩ [,∞) and ∩ (−∞, ], where ∈ R. For some random experiment one would naturally record the outcome as a number, and that gives a random variable, but in other cases one has to translate the outcome to a real number rst. See the rst example given above, but also more interestingly see the following example. Example 4.52. If you are plotting the position of a buttery in the form of two coordinates, (, ), then to get a random variable you have to turn those two numbers into one. You could, for example, compute the distance of the buttery from a xed point, and that could be considered a random variable. Technically this amounts to doing the following. We have a probability space with underlying sample set R×R, and a set of events based on the Borel -algebra, where all sets of the form [, ′]× [, ′], for , ′, , ′ ∈ R are events. We assume that there is a probability density function describing the probability that the buttery is at point a given point (, ). One suitable such function is R× R R+ (, ) −1/2( 2+2) 2 .187 −2 −1 0 1 2 −2 −1 0 1 2 0 0.05 0 0.02 0.04 0.06 To create a random variable we would like to apply the function R× R R (, ) √︀ 2 + 2 to the location, which gives us the buttery’s distance from some chosen point that here is assumed to be (0, 0). We have taken two-dimensional data and turned it into a random variable, which requires the restriction to just one dimension. However, calculating the probability density function of this random variable is non-trivial. Alternatively you could measure the distance relative to a north/south (or other) axis. For example, you could project your position onto its -axis, and then you could calculate the probability density function of the resulting random variable as R R ∫︁ ∞ −∞ −1/2( 2+2) 2 . Example 4.53. Consider Example 4.23 where we have given several probability spaces one might use to describe throwing two dice. If you pick as the space the one with outcomes {(, ) | , ∈ {1, 2, 3, 4, 5, 6}}, then the function which maps the pair (, ) from that set to the sum of eyes shown + , 188 (viewed as an element of R) is a random variable25 : {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6} R (, ) + . Whenever we have a random variable we get an induced probability distri- bution. In order to calculate the probability that takes the value 4 we have to calculate26 ({(, ) ∈ {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6} | (, ) = 4}) = ({(1, 3), (2, 2), (3, 1)}) = ({(1, 3)}) + ({(2, 2)}) + ({(3, 1)}) = 1 36 + 1 36 + 1 36 = 3 36 = 1 12 . This is usually written in the shortcut notation of ( = 4). But note that since we have translated our outcomes into real numbers we may also ask, for example, what the following probabilities are: ( ≤ 4) ( ≤ −4) ( ≤ 5.5) ( ≥ 10) The events described here do not look as if they have anything to do with the original experiment of rolling two dice, but since we have translated the outcome from that experiment into real numbers we may construct such events. These probabilities can be calculated as follows: • ( ≤ 4). This can be calculated by splitting it into the possible outcomes satisfying that property. ( ≤ 4) = (( = 2) ∪ ( = 3) ∪ ( = 4)) = ( = 2) + ( = 3) + ( = 4) = 1 36 + 2 36 + 3 36 = 1 6 . • ( ≤ −4). Clearly there are no possible outcomes which satisfy this condition, so this probability is 0. • ( ≤ 5.5). This works similar to the rst calculation. ( ≤ 5.5) 189 = (( = 2) ∪ ( = 3) ∪ ( = 4) ∪ ( = 5)) = ( = 2) + ( = 3) + ( = 4) + ( = 5) = 1 36 + 2 36 + 3 36 + 4 36 = 10 36 = 5 18 . • ( ≥ 10). This is similar to the previous example. ( ≥ 10) = ( = 10) + ( = 11) + ( = 12) = 3 + 2 + 1 36 = 1 6 . Below we describe how this works for arbitrary random variables. In general given a random variable on a sample space (, ℰ , ), and real numbers and ′, we dene • ( ≤ ≤ ′) = { ∈ | ≤ () ≤ ′} • ( ≤ ) = { ∈ | ≤ ()}, ( < ) = { ∈ | < ()}, • ( ≤ ′) = { ∈ | () ≤ ′}, ( < ) = { ∈ | () < }.. The general case is given by the following proposition. Proposition 4.7 Let be a random variable over the probability space (, ℰ , ). The probab- ility distribution of is determined by the fact that, for any real interval we have ( ∈ ) = { ∈ | () ∈ }. In other words, if we are given an interval in R then in order to determine its probability we ask for the probability of the event given by all those elements of the original sample space which are mapped into that interval. Note that the sets that appear on the right hand side of the equal sign appear in the denition of measurability. This ensures that in the original probability space we have a probability for the set in question. Denition 34: discrete/continuous random variable A random variable is discrete if and only if its range is a countable27 subset of R. A random variable which is not discrete is continuous. While there is a mathematical theory that allows the discrete case to be treated at the same time as the continuous one, covering the mathematics that allows this is beyond the scope of this course unit. In what follows the discrete case is frequently treated separately. In the text, and in some of the results given, some guidance is given on how the discrete case may be seen as a special case of the continuous one. 25Random variables are typically named using capital letters from the end of the alphabet. 26You may want to return to Example 4.23 for an explanation. 27What this means formally is discussed in Section 5.2. Every nite set is countable, and you may think of countable sets as ones that can be described in the form { | ∈ N}. 190 Note in particular that if a random variable has a nite range. then Proposi- tion 4.7 indicates that we can treat it in much the same way as we did a probability space with a nite sample set where every set of the form {}, for ∈ , is an event. Example 4.54. If we look at Example 4.49 it is clear that there are only two possible outcomes of the given random variable , namely 0 and 1, and that each of those occurs with probability 1/2. This means that when we calculate the probability ( ≤ ), then this is completely determined by which of 0 and 1 is in the given interval. In particular we have ( ≤ ) = ⎧⎪⎨⎪⎩ 0 < 0 1/2 0 ≤ < 1 1 else. Note that every discrete random variable has a range of the form { ∈ R | ∈ N}, that is a subset of R that is indexed by the natural numbers. For such a random variable, say , further note that = , for ∈ N gives us a collection of pairwise disjoint events which collectively have probability 1, as you are asked to show in the following exercise. Exercise 80. Let be a discrete random variable with range { ∈ R | ∈ N}. Show that ∑︁ ∈N ( = ) = 1. Example 4.55. Assume we are conducting a random experiment that consists of tossing a coin three times. In order to record the possible outcomes we can use strings of length three which give the outcomes for each toss, resulting in the sample space = {,,,, , , , }. This means we can describe the elements of via = {123 | 1, 2, 3 ∈ {,}.}. Each of these occurs with probability 1/8. We can turn this experiment into a random variable by converting each outcome into into a number. Here we pick the number of heads shown. 191 The resulting random variable, say , has the following range: {0, 1, 2, 3}. since the number of heads may range from 0 to 3. In order to nd the probability of each of the possible results we have to work out which events are mapped to which number. Recall that a random variable is a function, and our function is given by {} R 123 no of in 123. The probability for each possible value for is given by adding all the probabilities of outcomes from the original space which are mapped to it. For example, ( = 2) = {123 ∈ | no of is 2} = {,, } = 1 8 + 1 8 + 1 8 = 3 8 . We can conveniently give these probabilities in a table. To help illustrate how those probabilities come about we also give the outcomes from the original space in the column of the number they are mapped to by . 0 1 2 3 1/8 3/8 3/8 1/8 This is the probability mass function for the random variable, which is formally dened below. We can now ask questions such as what is the probability that the number of heads is at most 2, or what is the average number of heads tossed. Exercise 81. Consider the experiment that consists of tossing a fair coin four times. Consider the random variable which records the number of heads thrown. Calculate the following probabilities. (a) ( = 2), (b) ( ≤ 3), (c) ( ≤ ), (d) ( ≥ 3), (e) ( ≥ 10), (f) ( < −1). 192 (g) (( = 1) ∪ ( = 3)). (h) ( is even). Note that this only makes sense because we know the range of is a subset of N—it does not make sense to talk of evenness for numbers in R. 4.4.2 A technical discussion What follows is a fairly technical discussion regarding why we can dene probabil- ities in the way outlined above. The material from this subsection is not examinable, and you should feel free to skip it when reading the notes. Optional Exercise 16. Show that if (, ℰ , ) is a probability space, and if we have a random variable : R then given and ′ in R we have that ( ≤ ≤ ′) is an event. This means that (( ≤ ≤ ′) is always dened. Proposition 4.8 Let (, ℰ , ) be a probability space, and let : R be a function. If is measurable (and so a random variable) then for every in the Borel -algebra ℰ on R we have that { ∈ | ∈ } is an event, that is an element of ℰ . Proposition 4.9 Let (, ℰ , ) be a probability space, and let be : R a measurable func- tion. For ∈ N let be an interval in R such that the the are pairwise disjoint. If we dene, for ∈ N, = { ∈ | ∈ }, then the are pairwise disjoint and so ( ⋃︁ ∈N ) = ( ⋃︁ ∈N ) = ∑︁ ∈N = ∑︁ ∈N . As a consequence we get the following result: Theorem 4.10 Let (, ℰ , ) be a probability space, and : R a measurable function. Then a probability space is given by R, the Borel -algebra ℰ , and the prob- 193 ability distribution ℰ [0, 1] ({ ∈ | ∈ }). Example 4.56. Looking back at Example 4.54 we can see how we have eect- ively dened a probability distribution for R. It can be described as follows: Given an element of the Borel -algebra ℰ the probability of is given as = ⎧⎪⎨⎪⎩ 0 0, 1 /∈ 1/2 exactly one of 0, 1 in 1 else. In general, if a random variable has a nite range,28 say {1, 2, . . . , } in R then given an interval we have that ( ∈ ) = ( ∩ {1, 2, . . . , }) = ∑︁ ∈{1,2,...,},∈ ( = ). In other words we add up all the probabilities for those elements of the range of which are elements of . Alternatively we may restrict ourselves to the range of the underlying measur- able function to dene a probability space—in this way we remove those parts of R which are assigned a probability of 0. Theorem 4.11 Let (, ℰ , ) be a probability space, and : R a measurable function. Then a probability space is given by the range of , the -algebra { ∩ | ∈ ℰ}, and the probability distribution { ∩ | ∈ ℰ} [0, 1] ({ ∈ | ∈ }). Example 4.57. If we once again look at Example 4.54 then the range of the random variable is {0, 1}. The set of events given in the previous theorem is then merely the powerset of this set. The probability distribution is given by the following assignment: ∅ = 0 28This result can be extended to the case where has a range that can be expressed as { | ∈ N}. 194 {0} = {1} = 1 2 {0, 1} = 1. Tip 1 Theorem 4.11 eectively tells us that it is okay to dene a random variable as a function from some sample set to a subset of R. This can be useful when describing specic situations. Below we point out when we do this the rst few times and then we do this tacitly. 4.4.3 Calculating probabilities for random variables Above we dene probabilities for random variables. You can think of them as translating the original outcomes into numbers in such a way that we can look at the probabilities of subsets of R instead of events from the original space. The previous section establishes that we can take the original probability distribution and transfer it to the random variable, which gives another probability distribu- tion, this time over the real numbers, which means that all the usual results (see Sections 4.2.5 and 4.3.3) hold. One of the advantages of considering random variables is that it allows us to compute probabilities with very little information, in particular without knowing too much about the original probability space. Example 4.58. Assume that is a random variable and that we know that • ( = 1) = 1/2, • ( = 2) = 1/4, and • ( ≥ 2) = 1/2. This is enough to allow us to calculate, for example • ( = 0) = 0, • ( ≤ .5) = 0, • ( > 2) = 1/4. We can see from the given information that the total probability of 1 is distrib- uted in the following way: • 1/2 of the available ‘probability mass’ goes to 1; • 1/4 of it goes to 2; • the remaining 1/4 goes to the interval (2,∞), and we cannot tell more precisely where it goes from the given data. 195 In particular this means that none of the probability goes to 0, or to any number below 1, which explains the rst two claims. We can also derive the nal result more formally by noting that ( ≥ 2) = ( = 2) ∪ ( > 2) and that the two sets whose union we form are disjoint, which means we have 1/2 = ( ≥ 2) = ( = 2) + ( > 2) = 1/4 + ( > 2), from which we may deduce the last result. Note in particular that we do not know whether is a discrete or a continuous random variable! The fact that it has non-zero probability for being equal to 1 and 2 might suggest it is the former, but it could still be the case that the behaviour is continuous for values beyond 2. See Example 4.72 for a way of picturing some of this information. Example 4.59. Recall Example 4.55, where we look at a random variable given by the number of heads recorded when tossing a coin three times. For this random variable we can see, for example, that ( > 2) = ( = 3) = 1 8 , of that ( ≤ .5) = ( = 0) = 1 8 , and that ( ∈ (−∞,−2] ∪ [5,∞)) = 0. Because we know that we may consider the outcomes as real numbers we can write down (and calculate) the probability for the random variable taking its value in any interval, for example. This is quite useful, and in Section 4.4.4 we demonstrate that we can also use this to graphically represent the given probability distribution. Exercise 82. Assume you have a probability space with outcomes {1, 2, 3, 4, 5}, and that the following hold: • The outcomes 1 and 2 are equally likely. • The outcomes 3, 4 and 5 are equally likely. • The outcomes of the rst kind are three times as likely as the outcomes of the second kind. 196 A random variable is given by the function dened by : {1, 2, 3, 4, 5} R ⎧⎪⎨⎪⎩ 1 = 1 or = 2 3 = 3 5 else. Compute the following: (a) ( ≤ 1.5), (b) ( ≥ 3), (c) (2.5 ≤ ≤ 3.2), (d) ( ≥ 6). Sometimes we want to take a random variable, which is a function that maps outcomes to real numbers), and apply another function to it so as to translate the outcomes. Example 4.60. Assume that I’ve been given the measures in height of a group of people, where the measures have been carried out with great precision. If I have the measures for everybody in the population I can give the probability distribution of the random experiment given by picking a person (randomly) from the group. One might want to treat this like a continuous random variable if a lot of people are involved. But maybe for my purposes I only care about how many people I have in much loser categories. Assume that I’m only interested in the following categories: • People who are at most than 140 cm tall or • people who are from 140 to 160 cm tall or • people who are from 160 to 180 cm tall and • people who are taller than 180 cm.29 I would like to count how many people out of the group belong to each category to construct a probability space which allows me to work out the probabilities that a randomly chosen person from that group falls into a particular category. I may create another random variable by composing with the function : R R given by the following assignment. ⎧⎪⎪⎪⎨⎪⎪⎪⎩ 1 ≤ 140 2 140 < ≤ 160 3 160 < ≤ 180 4 180 < . 197 I can then compute the probability for the new outcomes, given by the range of the composite ∘ , by counting how many people fall into each category and dividing by the total population count—which gives the same result as taking the original probability distribution for and using to translate it to the new outcomes. We can see from the preceding example that it can be useful to take a given random variable and use a function on its possible values (here mapping actual heights to representative of some height categories) to get a dierent (but related) random variable that better expresses whatever we are concerned with. Example 4.61. In the robot Example 4.45) one might want to consider the orientation of the robot and view it as an angle from 0 to 360 degrees. The orientation is a continuously varying entity, but for the purpose of performing calculations one might split it into a nite number of parts of equal size, creating a discretely valued random variable, which makes it easier to carry out calculations (Bayesian updating in that case). The following result tells us that composing with a function R R always gives us another random variable, provided that the function is well behaved. Proposition 4.12 If is a random variable and : R R is a measurable function then ∘ is a random variable. For the random variable ∘ and an interval in R we have ( ∘ ∈ ) = { ∈ R | ∈ }. Optional Exercise 17. Show that if is a measurable function from some prob- ability space to R, and if : R [, ′] is measurable for the Borel probability space on the interval [, ′] then their composite ∘ is measurable. Example 4.62. Recall Example 4.53 of adding the eyes shown by two dice, which we may consider a random variable . We might instead only wish to record whether this number is even or odd. Theorem 4.11 tells us that it is okay to view as a function with target the set of natural numbers from 2 to 12. With that observation we may express our new object of interest by com- posing with the following function. : {2, 3, 4, . . . , 11, 12} {0, 1} mod 2 . 29Clearly one has to think about what should happen on the borderline—let’s assume here this belongs to the lower height category. 198 We may now compute the probabilities for ∘ as follows. ( ∘ = 0) = { ∈ {2, 3, 4, . . . , 11, 12} | mod 2 = 0} = {2, 4, 6, 8, 10, 12} = 1 36 + 3 36 + 5 36 + 5 36 + 3 36 + 1 36 = 18 36 = 1 2 . To calculate ( ∘ = 1) it is sucient to note that the two probabilities have to add up to 1, and so this is also 1/2. Example 4.63. Recall Example 4.55 where we considered the random variable given by counting the number of heads that appear when tossing a fair coin three times. We know that the range of is {0, 1, 2, 3}, so we may think of as a function from the original sample space to that set. The probabilities for the various outcomes are given in the following table. 0 1 2 3 1/8 3/8 3/8 1/8 Now assume we are interested only in whether the number of heads is more than one away of the number of tails, or not. The outcomes 1 and 2 satisfy that new property, and the outcomes 0 and 3 do not. Consider the following function. : {0, 1, 2, 3} {0, 1} {︃ 0 = 1 or = 2 1 else. Once again we use Theorem 4.11 to think of as a function with target set {0, 1, 2, 3}. Then composing with gives another random variable, with range {0, 1}, where 0 means the number of heads is at most one dierent from the number of tails, and 1 means the dierence is larger. We can determine the probability for the new outcomes by adding the probabilities of the old outcomes which are mapped to it. Again we give a table that provides the probabilities for each outcome, and above each outcome of ∘ we give the outcomes from that are mapped to it by . new outcomes 0 1 old outcomes 1, 2 0, 3 ( = 1) + ( = 2) ( = 0) + ( = 3) probabilities = 3 8 + 3 8 = 3 4 = 1 8 + 1 8 = 1 4 Example 4.64. Assume that we are again starting with the random variable that turns tossing a coin three times into the number of heads that appear among the three tosses, see the previous example. This time we want to change the random variable by only recording whether the number of heads is even or 199 odd. This means we are composing the random variable (viewed as having target set {0, 1, 2, 3} as before with the function : {0, 1, 2, 3} {0, 1} mod 2. Then the probabilities for the possible values of the random variable ∘ are given in the following table. new outcomes 0 1 old outcomes 0, 2 1, 3 ( = 0) + ( = 2) ( = 1) + ( = 3) probabilities = 1 8 + 3 8 = 1 2 = 3 8 + 1 8 = 1 2 Example 4.65. In Example 4.60 the random variable had four possible values, namely {1, 2, 3, 4}. Assume that the probabilities that a randomly chosen person from the monitor group ts into each category is given by the following table: 1 2 3 4 1/2 1/4 1/8 1/8 We can now calculate with these probability much as if we had a discrete probability space at the start. For example, if I want to know the probability that a member of my population is below 160cm, that is, belongs to categories 1 or 2, ( < 160) = ( ∘ = 1) + ( ∘ = 2) = 1 2 + 1 4 = 3 4 . Example 4.66. Assume that I am in the situation of Example 4.60, but now I am only interested whether somebody is below 160 cm or above. Then I can take my previous random variable, which produced the possible values {1, 2, 3, 4}, and compose it with the function {1, 2, 3, 4} {1, 2} {︃ 1 = 1 or = 2 2 else to get a new random variable whose only values which only distinguishes between people with a height of less than or equal to 160cm, which are in category 1, and those who are taller than 160cm, which are in category 2. 200 Example 4.67. Assume we have a random variable that has a range of values {−,−(− 1), . . . ,−2,−1, 0, 1, 2, . . . , − 1, }. Maybe for some purposes we are not interested in the values as such, but only in how far distant they are from the mid-point, 0. This might be because we are only interested in the dierence between some value and 0, but not whether that dierence is positive or negative (compare also Denition 39). By composing the random variable with the absolute function |·| : R R ||, we obtain a new random variable which takes its values in the set {0, 1, . . . , }. To calculate probabilities for we have to know that ( = ) = {︃ ( = ) + ( = −) 0 ≤ ≤ 0 else. CExercise 83. Recall the unfair die from Exercise 57. Take as a random variable the number of eyes shown. Calculate the following. (a) ( ≤ 3), (b) ( ≥ 5), (c) (4 ≤ < 6), (d) ( ≤ ), (e) ( ≥ 7). Now assume that the random variable is given by the sum of the eyes shown by two such dice. Calculate the following. (f) ( ≤ 4.5), (g) ( ≥ 11.5). Finally assume that we have the random variable and we compose it with the following function: : R R (− 7)2. Calculate the following. 201 (h) ( ∘ ≥ 6), (i) ( ∘ ≤ .5). 4.4.4 Probability mass functions and cumulative distributions There are many example of random variables where we do not need to worry about all real numbers but only those that appear in the range of the random variable. We can give a graphical presentation of how the probabilities is spread over that range. It is the equivalent to a probability density function for the case where we have discrete values. Denition 35: probability mass function Let be a random variable with a countable range, say { | ∈ N}. The probability mass function (pmf) for is given by { | ∈ N} [0, 1] ( = ). It is appropriate to think of a probability mass function as the discrete version of a probability density function. Example 4.68. For the random variable that consists of assigning the total number of eyes to the throw of two dice, see Example 4.53, the pmf is given by 2 3 4 5 6 7 8 9 10 11 12 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 This is of course the original probability distribution from Example 4.2 for one of the sample spaces discussed there—if the outcomes are already described as numbers then this is what happens. Example 4.69. If we throw a coin three times, see Examples 4.55, and use the random variable that arises from assigning to each output the number of heads that appear then we get the pmf as described in that example, 0 1 2 3 1/8 3/8 3/8 1/8. The following is a version of Proposition 4.2 for random variables with nite range. It says that if we have a pmf for a random variable then to know the probability distribution for that random variable we merely need to know the probabilities for each of the values in that range. 202 Corollary 4.13 Let be a random variable with nite range, say , and pmf . Then there is a unique probability space (,, ) with the property that for all elements ∈ we have {} = . For this space we may calculate for all subsets of that = ∑︁ ∈ . Proof. This is an application of Proposition 4.2. What this means is that if we have a probability mass function then we have a uniquely determined probability space, and so for a random variable with nite range all we need to understand the situation is the pmf. For this reason some people call a probability mass function a probability distribution. In Section 4.2.4 the idea of a cumulative probability distribution is introduced. At this point we are ready to dene that concept generally. Denition 36: cumulative distribution function Given a random variable the cumulative distribution function (cdf) for is the function R [0, 1] which assigns, for ∈ R, ( ≤ .) We are using here the fact that the real numbers are ordered and so it makes sense to ask for the probability that the random variable is at most some given number. In particular we can meaningfully draw the graph of this function and visualize the probability distribution in a way that we only do when the outcomes are given as numbers. When we have a random variable which can take a nite number of values we have to draw a non-continuous function, and you may nd this a bit odd at rst. Look at the following example to see how that works. Example 4.70. If we look at the situation from Example 4.69 where the pmf is described in the table 0 1 2 3 1/8 3/8 3/8 1/8. then the corresponding cdf can be drawn as follows. 203 0 1 2 3 4 0.2 0.4 0.6 0.8 1 Note that when drawing discontinuous functions like the above we have to specify what the value at at a discontinuity is, the lower or the upper of the two lines. The convention used in the picture above is to use the interval notation, so that [ and ] mean that the point at the end of the line belongs, and ( as well as ) mean that it doesn’t. An alternative way of drawing the same function is to use the following convention: 0 1 2 3 4 0.2 0.4 0.6 0.8 1 In the picture above the lled circle indicates that the endpoint of the line is included, and the unlled circle that it is excluded. In both pictures we can see that the functions jumps to a higher accumulated probability as the next possible value of the random variable is reached. The probability of fewer than 0 heads is 0, the probability of getting at least 0, but fewer than 1 heads is 1/8, and so on. Example 4.71. If we want to draw the graph of the oor function ⌊ ⌋ : R N, see page 43, we need this idea as well. 204 1 2 3 4 0 1 2 3 4 −1 Example 4.72. We return to Example 4.58, where the information given is that is a random variable and the following is known about its probability distribution is the following: • ( = 1) = 1/2, • ( = 2) = 1/4, and • ( ≥ 2) = 1/2. This is sucient to be able to draw some of the cdf, but there is uncertainty: 0 1 2 3 4 0.2 0.4 0.6 0.8 1 ∙ ? We know that the probability is 0 until the value 1 is reached, and that it rises to .5 at that point, and rising further to .75 from 2. What we don’t know is when it takes on the value 1 or which values it take between .75 and 1. Example 4.73. An example for the continuous case is given in Section 4.2.4 in the form of Examples 4.28 and 4.29. Recall that in the case of a continuous random variable with range contained in an interval ⊆ R the probability distribution is given in the form of a probability density function, say : R+. There are two cases. 205 • If the interval is of the form (−∞, ′), (−∞, ′] or R then the cdf for is given by ( ≤ ) = ⎧⎪⎨⎪⎩ ∫︁ −∞ ≤ ′ 1 else. • If the interval is of the form (, ′), (, ′], (,∞) [,∞) then the cdf for is given by ( ≤ ) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩ 0 ≤ ∫︁ ≤ ≤ ′ 1 else. In the case below is [0,∞), and we calculate the probability that is below . Note that the derivative of a cumulative distribution function is the correspond- ing probability density function (which in the discrete case is the corresponding probability mass function). We have no time to discuss here exactly how the derivative is formed in the discrete case. However there is something that is easy to see. Assume you have a random variable whose cdf makes a jump as in the following picture. Then it has to be the case that the probability of the random variable at the point where the jump is concerned is the dierence of the two values, that is ( = ) = ()− lim →∞ ( − 1 ). 206 Proposition 4.14 Let be a random variable. If is its cumulative distribution function then its derivative is the corresponding probability density (mass density) function. CExercise 84. For Exercise 57 consider the random variable given by the number of eyes the die shows. Give its pmf, and draw a graph for its cdf. Then do the same with the random variable ∘ from Exercise 83. EExercise 85. Assume that teams are regularly playing in a ‘best out of ve’ series against each other, compare Exercise 55. We assume here that the winner is determined via a random process. We are interested in the random variable given by the number of matches Team wins in a given series. (a) Describe a probability space that describes a ‘best out of ve’ series. For the probabilities assume that the two teams have an equal probability of winning any match. (b) Describe the function that underlies this random variable by writing down a mathematical function that carries out the required assignment. (c) For the case where team is equally matched by Team , give the pmf and draw a graph for its cdf. (d) Now assume that it is known that wins the rst match. We can now look at the random variable conditional on this event. Describe the pmf and cdf for the resulting random variable. Hint: Because the event is part of the original probability space, but cannot be formulated for the outcomes of the random variable , you cannot use the usual formula for conditional probabilities but have to analyse each case anew. If you are nding this part hard then looking ahead to Example 4.76 may help. Exercise 86. Carry out the same tasks as for the previous exercise for a ‘best out of seven’ series. 4.4.5 Conditional probabilities for random variables Recall that there is no example for conditional probabilities in the continuous case in Section 4.3 above. The reason for this is that describing the probability density function for the general case, where we may make no assumptions about the possible outcomes, requires mathematical techniques beyond this course unit. However, this is feasible once we restrict ourselves to random variables, where we know that the outcomes are elements of R. We revisit the idea of conditional probabilities, now conned to random variables. You can see below that in that case the denition for the continuous case is the same as that for the discrete one. 207 The conditional probability density function Recall that given two events and , where ̸= 0, the conditional probability of given is dened as ( | ) = ( ∩) . If is a random variable with probability distribution function then we may dene the conditional distribution of given the event (where we still assume ̸= 0) as ( ≤ | ) = (( ≤ ) ∩) . There is a conditional probability density function, which is once again the de- rivative of the corresponding distribution. The probability that the conditionally distributed random variable falls into a given interval is then the integral over that derivative over the given integral. If is a discrete random variable with pmf then given an event with ̸= 0 we can calculate the pmf of the random variable ( | ), given , by setting, for in the range of , = {︃ ( = ) ∈ 0 else. In other words, if we know that happens, and is a possible result of not in , then it has the probability 0, and otherwise the probability is adjusted by dividing through as expected. Example 4.74. We return to Example 4.55. The pmf for the random variable which gives the number of heads when a coin is tossed three times is as follows. 0 1 2 3 1/8 3/8 3/8 1/8. Let the event be that the result heads occurs at least once among the three tosses. The probability of is = 7/8. We may calculate (( = ) | ) = (( = ) ∩) () , where (( = ) ∩) = {︃ ( = ) ∈ 0 else. Hence the pmf of ( | ) is 0 1 2 3 0 3/7 3/7 1/7. If is the event that the number of heads is even then = 1/2 and the pmf of 208 ( | ) is 0 1 2 3 1/4 0 3/4 0. We look at the continuous case. Let be a random variable with range R and probability distribution , let be in R, and assume that is the event = ( ≤ ). We may calculate = ( ≤ ) = ∫︁ −∞ . We might then wonder how to calculate, for ∈ R, ( ≤ | ). What we do know is that if we have a probability density function for the resulting random variable we can calculate this probability as∫︁ −∞ . We can work out what the probability density function, say , should do: If the argument is not in it should return 0, and otherwise I should return the probability of adjusted by the probability of . Assuming that the probability of is non-zero, is given by : R R+ ⎧⎨⎩ ∫︀ −∞ ≤ 0 else. In the general case, where we make no assumptions about the shape of , we merely assume that the probability of is not zero. The probability distribution of the random variable = ( | ) is given by : R R+ {︃ ∈ 0 else. Note that the range of is included in . Example 4.75. If we return to Example 4.29 we have a random variable given by the time until the geyser next erupts. The probability density function is : [0, 90] [0, 1] 1 90 . Consider the event that the geyser hasn’t erupted in the 30 minutes we’ve already waited for it. We may calculate the probability of occurring by calculating the probability that the geyser does erupt in the rst 30 minutes, and deducting that from one. The probability that the geyser erupts between 209 minute 0 and 30 is ∫︁ 30 0 1 90 = [︁ 90 ]︁30 0 = 30 90 − 0 = 1 3 , so = 1− 1 3 = 2 3 . The probability density function of the random variable ( | ) is then given by : [0, 90] [0, 1] {︃ 0 0 ≤ ≤ 30 1 60 else. The examples we have considered here only work if the event on which we are conditioning can be expressed in terms of outcomes of the random variable in question. Sometimes we wish to condition on an event that can only be formulated in the original probability space, see Exercise 85 for an example. In that case the various conditional probabilities have to be calculated more painstakingly since we cannot apply the formulae derived above. We return to this idea in Section 4.4.6 after considering one more example. Example 4.76. We return to the random variable which counts the number of heads when tossing a coin three times, see Example 4.55, and contrast with Example 4.74. The pmf of is given by the following table. 0 1 2 3 1/8 3/8 3/8 1/8. Assume we wish to condition this random variable on the event that the rst toss is heads. The given pmf does not help us in calculating the pmf of ( | ). Instead we have to start over from the original probability space. We analyse the possible values of and the probabilities with which they occur. Assume the rst toss is heads. The possible numbers of heads among the three tosses are as follows. • 0. This cannot occur. • 1. This means the toss must be . This occurs with probability 1/4. • 2. This means the toss is or . This occurs with probabil- ity 1/2. • 3. This means the toss is . This occurs with probability 1/4. Hence the pmf of 210 ( | ) is 0 1 2 3 0 1/4 1/2 1/4. Note that it is also possible to perform Bayesian updating for random variables: In the case of a discrete random variable, the update procedure is just as described in Section 4.3.4. If the random variable is continuous then instead of updating the pmf by adjusting all the individual values we have to update the probability density function. Spelling out the resulting denition of the new probability density function goes beyond this course unit. One random variable depending on another The material in this subsection is not examinable. You may want to return to it if you ever have to cope with a situation where one random variable depends on another. Recall Example 4.37, where we were wondering about how to describe the probability density function for the location or a fox whose behaviour is inuenced by the location of a lynx (if the latter is close enough). What we have there is one random process, describing the movements of the fox, conditional on another random process, namely the movement of the lynx. We can only do this in the situation where we have a joint distribution, that is, a probability distribution, or a density function/pmf, that describes the combined probability. It is then the case that if is the joint density function for random variables and , we can derive density functions for and , namely • The probability density function for is30∫︁ ∞ −∞ (, ), • while that for is ∫︁ ∞ −∞ (, ). In this situation we can look at the case of the density function for given ( = ), for some ∈ R. We get = (, )∫︀∞ −∞ (, ) . If instead we are interested in the probability distribution for given ( ≤ ≤ ′), we have ( = | ≤ ≤ ′) = ∫︀ −∞ (︁∫︀ ′ (, ) )︁ ∫︀ ′ (︁∫︀∞ −∞ (, ) )︁ . If and are discrete random variables then we can look at their joint pmf. This is a function that, given 30You can calculate with these integrals by treating the other variable as if it were a parameter, that is, you integrate the rst expression for and treat as if it was a number. You swap the treatment of the two variables for the second expression. 211 • a value from the range of and • a value from the range of , returns the probability ( = and = ). Example 4.77. We return to the example of tossing a coin three times, see Example 4.55. The pmf of the random variable , which counts the number of heads, is 0 1 2 3 1/8 3/8 3/8 1/8. The random variable , which records the absolute of the dierence between the number of heads and tails, has a pmf given by the following table. 1 3 6/8 2/8. The joint pmf of and is given by the following table. \ 0 1 2 3 1 0 3/8 3/8 0 3 1/8 0 0 1/8 Independent random variables When we have two random variables which are independent from each other it becomes easier to calculate with both. Denition 37: independent random variables Two random variables and are independent if and only if it is the case that for all elements of the Borel algebra and ′ we have that ( ∈ and ∈ ′) = ( ∈ ) · ( ∈ ′). In particular this means that • if is a random variable with density function and • is a random variable with density function then the joint density function for and is given by (, ) · . We need this information when we wish to look at situation where we have several random variables, for example the failure of a number of pieces of equip- ment. This is easier if we assume that the failure of one is independent from the failure of the others but this assumption is only justied if we can exclude factors that would aect more than one piece of equipment, such as a power surge at some location. Example 4.78. We have already seen an example of this. When we look at the random variable which gives us the number of eyes shown by the red die, and the random variable which gives us the number of eyes shown by the blue die, then their joint pmf is given as follows. 212 ∖ 1 2 3 4 5 6 1 1/36 1/36 1/36 1/36 1/36 1/36 2 1/36 1/36 1/36 1/36 1/36 1/36 3 1/36 1/36 1/36 1/36 1/36 1/36 4 1/36 1/36 1/36 1/36 1/36 1/36 5 1/36 1/36 1/36 1/36 1/36 1/36 6 1/36 1/36 1/36 1/36 1/36 1/36 Exercise 87. Are the two random variables and from Example 4.77 inde- pendent? Justify your answer. EExercise 88. Assume that you are tasked by your boss with making sure that you have suciently many servers that in the course of the year the chance that all of them have failed is below 1%. Because you are able to place your servers at separate locations you are allowed to assume that one server failing will have no eect on the other servers. (a) Assume that the chance of one of your servers failing in a given year is .05. How many servers do you need to comply with your boss’s demand? How much safety do you get out of an extra server? (b) Assume that the probability of one of your servers failing has the probab- ility density function31 given by : [0, 365] [0, 1] 2 2 · 3653 , which we need to consider from = 0 to = 365 to cover the year. In other words, the probability that the server will have failed by the end of the year is given by the integral, from 0 to 365, over the given density function. How many servers do you have to buy and install to comply with the specication you were given? 4.4.6 Expected value and standard deviation One of the motivations for introducing the notion of random variables is the ability to form averages. Expected value Example 4.79. Returning to the example of the number of heads when tossing a coin three times, Example 4.69, you may wonder what the average number of heads might be. This case is so simple that you can probably guess the answer, but in more complicated situations you will want to carry out analogous calculations. If we weigh each possible outcome by its probability then this 31I’m not claiming this is a realistic density function, but hopefully it’s not too bad to calculate with. 213 number is given by 0 · 1 8 + 1 · 3 8 + 2 · 3 8 + 3 · 1 8 = 0 + 3 + 6 + 3 8 = 12 8 = 3 2 , so on average the number of heads is 1.5, which in this simple case you may have been able to guess. Note that if you wanted to bet on the outcome of this experiment then it does not make sense to bet the expected value since it cannot occur. We look at more interesting examples. Note that solving the following two examples require knowledge beyond this course unit—it is included here to give you an idea of how powerful the idea is. Example 4.80. Assume that we have strings which are generated in a random way, in that after each key stroke, with a probability of 1/2, another symbol is added to the string. We would like to calculate the average length of the strings so created. Before we can do this we have to specify when the random decision starts: Are all strings non-empty, or is there a chance that no symbol is ever added? We go for the latter case, but the calculation for the former is very similar. As is often the case when picturing a step-wise process we can draw a tree that describes the situation. At each stage there is the random decision whether another symbol should be added or not. We give the length of each generated string. 0 1/2 1 1/2 2 1/2 . . . 1/2 1/2 1/2 What this means is that we have a random variable, namely the length of the generated string, and we can see that its probability mass function has the rst few values given by the following table. 0 1 2 3 4 1/2 1/4 1/8 1/16 1/32 More precisely the pmf is given by the function N R 1 2+1 . 214 What is the average string length? The idea is that we should give each possible length the probability that it occurs. This means that we should calculate 0 · 1 2 + 1 · 1 4 + 2 · 1 8 + · · · = ∑︁ ∈N · 1 2+1 . With a bit more mathematics than we can teach on this unit it may be calculated32 that this required number is 1. So certainly when producing strings in this way we don’t have to worry about there being a lot of long ones! But note that we have described a process for producing potentially innite strings (with a probability of 0), and the power of the methods we use here is such that we can still calculate the average. We say more about how to cope with situations where we have to compute an innite sum in Section 4.4.6. Example 4.81. Assume we are tossing a coin until we get heads for the rst time, and then we stop (compare Exercise 59 and Example 4.27. We wonder what the average number of coin tosses is. Again it makes sense to draw a tree. 1 1/2 2 1/2 3 1/2 . . . 1/2 1/2 1/2 This is quite similar to the previous example! The pmf for this random variable is given by N R 1 2 . The expected value is ∑︁ ∈N · 1 2 = 2. Again calculating such expected values is not part of this unit, but it gives you one motivation why mathematicians care about what happens if innitely many numbers are added up. We say more about how to cope with situations where we have to compute an innite sum in Section 4.4.6, in particular Example 4.90 is relevant. What is it that we have calculated in these examples? 32In mathematical parlance, we have dened a series whose limit is 1. 215 Denition 38: expected value Let be a random variable with probability density function . Then the expected value of , (), is given by () = ∫︁ ∞ −∞ · (). Note that this denition does allow for the possibility that () is innite. This can never occur if is a discrete random variable with a nite range, but in the other cases this is a possibility. Calculating with innities is beyond the scope of this unit, and all the examples we study give a nite result. Note that if is a discrete random variable with range { | ∈ N}, then its expected value is () = ∑︁ ∈N · ( = ). This means that if is a discrete random variable with nite range {1, 2, . . . , }, then its expected value is () = 1 ( = 1) + 2 ( = 2) + · · ·+ ( = ) = ∑︁ =1 ( = ). Example 4.82. In Example 4.79 the expected number of heads when tossing a coin three times is calculated as being 1.5. In Example 4.70 the cumulative distribution for that random variable is drawn: 0 1 2 3 4 0.2 0.4 0.6 0.8 1 The area under the function from 0 to 3, shown in blue above, is given by .125 · 1 + .5 · 1 + .875 · 1 = 1.5, which is the same as the expected value. In general this is always the connection between the expected value and the area under the cdf, and this is the best 216 indication I can give that this area (and so an integral) has something to do with probabilities. Note that in the discrete case, the expected value need not be in the range of . In Example 4.79 the expected value is 1.5 heads in 3 tosses of a coin, which clearly is not a valid result of tossing a coin three times. Further note that even if the expected value is a possible outcome it need not in itself be particularly likely. Example 4.83. Assume we are playing a game with a deck consisting of four aces and the kings of spaces and hearts, {♣, ♠, ♡, ♢,♠,♡}. We each draw a card from the pack. If one of us has an ace and the other a king, the holder of the ace gets two pence from the other player. If we both have an ace, then if one of us has a black ace ♣ or ♠ then he gets three pence from the other player. If we have aces of the same colour neither of us gets anything. If both of us have a king then the holder of the black king gets one penny from the other player. We look at the random variable formed by the number of pence gained or lost by one of the players (since the rules are symmetric it does not matter which player we pick). Its range and pmf are given in the following table. −3 −2 −1 0 1 2 3 2/15 4/15 1/30 2/15 1/30 4/15 2/15 We calculate the expected pay-o. It is 1 30 (−3 · 4 + (−2) · 8 + (−1) · 1 + 0 · 4 + 1 · 1 + 2 · 8 + 3 · 4) = 0. We could have saved us this calculation by making the following deduc- tions: The game is completely symmetric, and wins for one player are paid for by the other.33 So if one player were to expect a gain the other player would have to expect a loss to make up for that game, but the rules are exactly the same for both. We note that the expected value 0 does not occur with a particularly high probability. Also note that the expected value does not have to be halfway between the extremes of the possible outcomes. This is illustrated (among other things) in the following example, where we calculate the average of an average to show that it is possible to have several layers of random variables, which still allow us to calculate an overall expected value. Example 4.84. For a more down to earth example let us revisit Example 4.43. There we are faced with 4 possibilities regarding which situation we are in (given by the number of red socks in the bag). This gives us an opportunity to look at an expected value for dierent probability distributions. Here is a tree that describes the drawing of two socks (with replacement) 33This is a zero-sum game in the parlance of game theory. 217 from a bag that contains red socks from a total of 3 socks. /3 (3−)/3 /3 /3 (3−)/3 (3−)/3 We have here a random variable which maps the outcomes from this tree to the number of red socks drawn. Hence it maps to 2, the outcomes and to 1 and the outcome to 0. The pmf of this random variable is 2 1 0 2 9 2 (3− ) 9 (3− )2 9 Hence the expected value for the number of socks is 2 2 9 + 2 (3− ) 9 = 22 + 6− 22 9 = 6 9 . So the expected value in each case is 0 1 2 3 () 0 2/3 4/3 2 Note how the expected value varies with the underlying situation, and note that in none of the cases we get as the expected value the halfway point between the two extremes 0 and 2. We can use these expected values to calculate an overall expected value based on our current estimate for the true probability distribution. At the beginning, the probability of the possible outcomes, {0, 1, 2, 3} is equal, 1/4 for each. If we draw two socks (returning the sock to the bag after each draw) then we would expect to draw one red and one black sock on average. In the original Example 4.43, the rst update given results in a changed contribution which is as follows. 0 1 2 3 0 1/6 1/3 1/2 218 If we want an overall expected value based on our current knowledge, which is given by the current distribution, then we should form an average where each of the previously calculated expected values is weighted by the probability that we think it’s the correct one, giving an overall expected value of 0 · 0 + 2 3 · 1 6 + 4 3 · 1 3 + 2 · 1 2 = 2 + 8 + 18 18 = 28 18 = 1.5 for the situation after the rst update in the original example. Example 4.85. For a simple continuous example we return to the random vari- able that describes the amount of time until a geyser erupts from Examples 4.29 and 4.75. The expected time we have to wait until the geyser erupts is∫︁ 90 0 90 = [︂ 2 2 · 90 ]︂90 0 = 902 2 · 90 − 0 = 45, which tells us that we have to wait 45 minutes on average as expected. Whenever we calculate an expected value we calculate a probability-weighted average, that is, we try to give some kind of number that occurs ‘on average’. We should be careful when we use such calculations to make decisions—for example, the expected pay-o of playing some game being positive is by itself not a good enough reason to play that game. We’ve assigned numbers to certain outcomes, but these numbers might not adequately reect our valuation of the situation. Example 4.86. Assume somebody oers you a game: You toss a coin. If it gives heads, you pay a million pounds, if it’s tails, you get a million and one pounds. The expected value of this game is 50 for you, but can you aord to lose this game? Whenever we use expected values to give an assessment of risk, apart from making sure we have our probabilities right we should carefully check whether the numbers of the given random variable truly reect how we judge the relevant outcomes. Exercise 89. For the expected value given at the end of the previous example, what is the underlying random variable? Give its range and its pmf. CExercise 90. You are invited to play the following game: There are three cards: • One is black on both sides, • one is red on both sides, • one is black on one side and red on the other. You and another person pay one pound each into a kitty. The three cards are put into a bag and mixed together. Without looking into a bag you draw a card. You pull it out of the bag in a way that only the upper side can be seen, and you place it on the table. The card is red on the side you can see. The other player bets that the card has the same colour on the hidden side as is showing. You’re unsure whether you should bet on it having a dierent 219 colour on the other side. The other player points out that it can’t be the card that is black on both sides, so you have a 50-50 chance. The winner of the bet is to get the two pounds put into the kitty at the start. Should you accept this as a fair game, or should you ask for your pound back? Answer this question by calculating the expected value of the amount you have to pay. Using conditioning to calculate expected values Recall Example 4.81 where we determined the expected number of coin tosses until we get heads for the rst time. If we use the denition of the expected value then we have to calculate with an innite sum to nd that number. We can use conditional probabilities to help with this situation, see Section 4.4.5 for a general account of the probability distribution of a random variable condi- tioned on an event. In this section we are concerned with how to calculate the expected value. of such a random variable. We rst look at the general case. Let be a random variable with probability density function and let be an event with non-zero probability which is a subset of R. Then ( | ) = ∫︁ 1 · · = 1 ∫︁ · , Note that the integral looks similar to the integral dening the expected value of , but we cannot use one to calculate the other since the areas over which we integrate dier. Example 4.87. In Example 4.85 we calculate the expected value of the time we have to wait when we visit the geyser from Example 4.29, which is 45 minutes. In Example 4.75 we give a probability distribution conditioned on the event that the geyser has not erupted in the last thirty minutes. Applying the ideas above we calculate the expected value of ( | ), using the probability density function calculated in Example 4.75, which is : [0, 90] [0, 1] {︃ 0 0 ≤ ≤ 30 1 60 else. We have ( | ) = ∫︁ ∞ −∞ = ∫︁ 90 30 60 = [︂ 2 2 · 60 ]︂90 30 = 1 2 · 60(90 2 − 302) = 1 120 · 7200 = 60. This may strike you as unexpected: When we arrived we thought we might have to wait 45 minutes, but knowing that the geyser has not erupted for 30 220 minutes so far means that if we look at the conditional random variable we have to adjust our expectations to be rather more pessimistic! In the discrete case the expected value can be expressed as follows. Let the range of be given by { | ∈ N}, and let be an event with non-zero probability. Then ( | ) = 1 ∑︁ ∈N, ∈ ( = ). If we further reduce this to the case where the range of is nite, then we may calculate the elements of the nite set { ∈ | is in the range of X}, say {1, 2, . . . , } and then we have ( | ) = 1 (1 ( = 1) + 2 ( = 2) + · · ·+ ( = )). Example 4.88. Recall the random variable from Example 4.55 of the number of heads when tossing a coin three times. We calculate its expected value as 1.5 in Example 4.88 and we calculate its conditional pmf for the event that there is at least one head in Example 4.74. The expected value of ( | ), whose pmf is (as per Example 4.74) 0 1 2 3 0 3/7 3/7 1/7. is given by ( | ) = 1 7 (1 · 3 + 2 · 3 + 3 · 1) = 12 7 . Alternatively we can use the formula from above to carry out this calculation based on the pmf of , which is given by 0 1 2 3 1/8 3/8 3/8 1/8. The calculation then is ( | ) = 8 7 · 1 8 (1 · 3 + 2 · 3 + 3 · 1) = 12 7 . Exercise 91. For the random variable of tossing a coin three times, and the event from Example 4.74 calculate the expected value of the random variable = ( | ). Note that above we assume that the event on which we are conditioning is an event that can be formulated regarding the outcomes of the given random variable . What happens quite frequently is that one wishes to condition on an event that can only be formulated in the original probability space. In those cases there is no way of applying the formulae derived above. 221 Example 4.89. Consider the random variable from Example 4.55 where we toss a coin three times. We might wish to condition over the event that the rst toss is heads. This is not something we can formulate by only referring to outcomes of this random variable. The pmf for the random variable ( | ) is carried out in Example 4.76, where it is given as follows. 0 1 2 3 0 1/4 1/2 1/4 With the help of that pmf we can calculate the expected value ( | ) = 1 4 (0 · 0 + 1 · 1 + 2 · 2 + 3 · 1) = 8 4 = 2, but we cannot calculate this expected value from the expected value of . There is a useful technique for calculating expected values of random variables when it is easier to calculate expected values for conditioned versions of that random variable. Applications of the following result follow below. Proposition 4.15 Let be a random variable over the probability space (, ℰ . ) and assume that we have pairwise disjoint events 1, 2, . . . such that ⊆ 1 ∪2 ∪ · · · ∪. Then = ( | 1) · 1 + ( | 2) · 2 + · · ·+ ( | ) · . Proof. Proving the general case goes beyond what we cover on this course unit. For the discrete case, let us assume that the range of is given by { ∈ R | ∈ N}. Then = ∑︁ ∈N ( = ) def = ∑︁ ∈N ( ( = 1 | 1)1 + ( = | 2)2 + · · · + ( = | )) tot prob = ∑︁ ∈N ( = 1 | 1)1 + ∑︁ ∈N ( = | 2)2 + · · ·+∑︁ ∈N ( = | ) = 1 ∑︁ ∈N ( = 1 | 1) + 2 ∑︁ ∈N ( = | 2)+ 222 · · ·+ ∑︁ ∈N ( = | ) = ( | 1)1 + ( | 2)2 + · · ·+ ( | ). This completes the proof. We show how to use this idea to calculate expected values of a random variable conditioned over an event of the original probability space. Example 4.90. We are interested in the random variable which gives the number of tosses of a coin until we see heads for the rst time. We would like to calculate its expected value . Note that in Example 4.81 we required an innite sum to nd that value. Here we give an alternative method that does works without innite sums. Since the probability of the rst toss being heads is 1/2, and since this is also the probability of the rst toss being tails, we may use the proposition above to write = 1 2 ( | rst toss ) + 1 2 ( | rst toss ). We look at the two expressions on the right hand side. • If the rst toss is heads then we may stop tossing our coin, and so then the expected value of , conditional on the rst toss being heads, is 1. • If the rst toss is tails then it is as if we had not started to toss at all, and the expected value of , conditional on that event, is one more than the expected value of . From these considerations we get = 1 2 · 1 + 1 2 (1 + ()) = 1 + 1 2 . We can treat this as an equation in and solve it to give = 2. For an alternative way of looking at the situation we draw the appropriate tree. 1/2 1/2 . . . 1/2 1/2 we can see that below the node labelled in the picture, we have another copy of the same tree. In other words, the tree branches to • , where it ends or 223 • , below which another copy of the whole tree appears.34 Because a copy of the innite tree appears within itself we can use the trick of establishing an equation for , where here we argue that the expected value, that is the ‘average’ number of tosses until the experiment ends, is given by • with probability 1/2, the rst toss results in and the experiment ends after 1 toss and • with probability 1/2, the rst toss results in , and then the expected number of additional tosses is the same as before, so the overall number of tosses is 1 added to the expected number of tosses. This leads to the same equation as above, namely = 1 + 1 2 . In general we can often avoid having to calculate with innite sums by using similar techniques. Assume we have a random experiment which has a particular result with property , and another result ′ with property 1 − and that previous experiments have no eect on subsequent ones. We are interested in the expected value of the random variable of how many times we have to repeat the experiment to get the second outcome we can see that we have = (1− )( | rst outcome ′) + (( | rst outcome )) = (1− ) · 1 + (( | rst outcome )) = (1− ) + (1 + ) = 1 + . This means that in this situation we get that = 1 1− . Example 4.91. Assume we have a coin that shows head with probability , and tails with probability 1− . Let be the random variable of the number of coin tosses required until we see heads for the rst time. In Example 4.90 we calculate the expected value of in the case of a fair coin. Here we want to establish that it is possible to condition on the two disjoint events, namely that the rst toss gives heads, or that the rst toss gives tails, and use those to express the expected value of . = ∑︁ ∈N ( = ) def = ∑︁ ∈N ( ( = | fst toss ) (fst toss ) + ( = | fst toss ) (fst toss )) law of tot prob = ∑︁ ∈N ( = | fst toss ) (fst toss ) 34This can only work with innite structures. 224 + ∑︁ ∈N ( = | fst toss ) (fst toss ) = ( | fst toss ) (fst toss ) + ( | fst toss ) (fst toss ) = ( | fst toss )+ ( | fst toss )(1− ). This idea generalizes to similar experiments with several outcomes. Assume there are possible outcomes 1, 2, . . . and that • for 1 ≤ ≤ − 1 outcome occurs with probability and • outcome occurs with probability 1− (1 + 2 + · · ·+ −1). Then the expected number of times we have to repeat the experiment to get outcome has to satisfy the equation = 1− (1 + 2 + · · ·+ −1) + 1(1 + ( | 1st 1)) + · · ·+ −1(1 + ( | 1st )) = 1 + (1 + 2 + · · ·+ −1) and so we must have = 1 1− (1 + 2 + · · ·+ −1) . EExercise 92. Assume you have a fair coin. (a) What is the expected number of tosses until you have two heads in a row for the rst time? (b) What is the expected number of tosses until you have heads immediately followed by tails for the rst time? (c) Assume you are invited by one of your friends to play the following game: A coin is tossed unto either • two heads occur in a row for the rst time or • we have heads immediately followed by tails for the rst time. In the rst case you get 6 pounds and in the second case you have to pay the other player 5 pounds. Should you play this game? Hint: Use the same idea as in Example 4.90. For the rst part, check the situations you may nd yourself in after two tosses. Properties of expected values We know from Proposition 4.12 that we may compose a random variable with a (measurable) function from its range to (a subset of) R and that gives another 225 random variable. But in general there is no easy formula for the expected value in that situation: Composing with a function will lead to a dierent probability density function, and forming the integral over that cannot in general be expressed in terms of the integral giving the expected value for the original random variable. Even if the given random variable is discrete we do not get a simple formula: Assume that is a measurable function from the range of a random variable to a subset of R. Then the new random variable has an expected value of ( ∘) = ∑︁ ∈range · ( = ). This indicates that there is no easy to calculate the expected value of ∘ from that of . This situation only changes when is a very simple function. Exercise 93. Let be a random variable; consider the following function: : R R 1. Calculate the expected value of the random variable ∘ . If the function is a linear function (compare Chapter 0) then we can compute the expected value of ∘ from that of . Assume we have a discrete random variable , with range { ∈ R | ∈ N‘}. Let and be real numbers. We can compose with the function R R + . What is the expected value of the resulting random variable? We can calculate ( + ) = ∑︁ ∈N ( · + ) · ( + = · + ) = ∑︁ ∈N ( · + ) · ( = ) = ∑︁ ∈N · · ( = ) + · ( = ) = ∑︁ ∈N · ( = ) + ∑︁ ∈N ( = ) = · () + . See Exercise 80 for an explanation of the last step. Proposition 4.16 Let be a random variable, and let and be real numbers. Then the expected value of the random variable + , which is formed by composing with the function R R + . 226 has an expected value given by ( + ) = () + . Proof. An argument for the discrete case is given above. The general argument proceeds as follows. ( + ) = ∫︁ ∞ −∞ + · (+ ) = ∫︁ ∞ −∞ · ()+ ∫︁ ∞ −∞ () = () + . If we have two random variables then we can say something about combining them. Proposition 4.17 If and are random variables then ( + ) = + . If and are independent then we also have ( · ) = · . Example 4.92. This can be a very useful result when we want to calculate expected values. For example, if we want to calculate the expected number of heads when tossing a coin 20 times then carrying out a calculation where we look at all the possible permutations of results we might get is tough. So instead of doing that we can think of the random variable thus created as being the sum ∑︁ 1≤≤20 where is the random variable we get from the number of heads on the th toss of the coin. For each we have = 12, so = ∑︁ 1≤≤20 = ∑︁ 1≤≤20 = ∑︁ 1≤≤20 1 2 = 20 2 = 10 and we have found an easy way to calculate this number. I assume many of you would have guessed this to be the expected value, but now we can be sure this answer is the correct one. Exercise 94. For the situation where a ‘best out of ve’ series is played carry out the following tasks. (a) Calculate the expected value for the number of matches that occur in a ‘best out of ve’ series, see Exercise 55. 227 (b) Calculate the number of matches can expect to win, see Exercise 85. (c) There is a connection between these two expected values. What is it, and can you explain why it has to be like that? We are now able to paraphrase an important law that, for example, explains why Bayesian updating works. You will meet this idea again in COMP13212, Data Science, in one of the early lectures. Fact 13: The Law of Large Numbers Let , for ∈ N, be pairwise independent random variables with the same distribution. Further assume that the expected value of the is ∈ R, and that the random variables have a nite variance (see Denition 39). Then 1 +2 + · · ·+ converges towards with probability 1, as tends towards innity. Example 4.93. Assume we are tossing a coin, and we use the random variable (say 0 for heads and 1 for tails) to express the th coin toss. The expected value of each of the random variables is 1/2. Then if we keep tossing the coin we nd that with probability 1 the average 1 +2 + · · ·+ will move closer and closer to 1/2—in other words, the more often we toss the coin, the closer to 1 will be the ratio of heads to tails observed. A rather simplied way of paraphrasing this law is to say that the more often we carry out a random process the closer the average of all our observations is to the expected value. 4.4.7 Variance and standard deviation The expected value of a random variable allows us to ‘concentrate’ it’s behaviour into just one number. But as Examples 4.79 and 4.83 illustrate, the expected value can be misleading regarding which values are likely to occur. One way of measuring how far a random variable deviates from its expected value is to do the following: Let be a random variable. • Calculate the expected value of . • Create a new random variable in two steps: – Subtract the expected value from to form the random variable − . – To ensure that positive and negative dierences from the expected value cannot cancel each other out (and to amplify dierences), form the square of the previous random variable to give ( − )2. 228 • Calculate the expected value of the new random variable. Example 4.94. We return to Example 4.69 of tossing a coin three times, where we count the number of heads seen to get a random variable . We recall from Example 4.79 that the expected value of is 1.5. If we form − 1.5 we get a new random variable with range {−1.5,−.5, .5, 1.5} and pmf −1.5 −.5 .5 1.5 1/8 3/8 3/8 1/8. If we square the result we have the random variable ( − 1.5)2 with range {.25, 2.25} and pmf .25 2.25 6/8 = 3/4. 2/8 = 1/4 Its expected value is .25 · 3 4 + 2.25 · 1 4 = 0.75 + 2.25 4 = 3 4 = .75. Hence the variance (see denition below) of the random variable is .75. Denition 39: variance If be a random variable with expected value its variance is given by (( − )2). As pointed out above, the variance amplies larger deviations from the ex- pected value by squaring the dierence, and it returns the square of the expected dierence. For some considerations it is preferred not to carry the last step, leading to a slightly dierent way of measuring how far a random variable strays from its expected value. Denition 40: standard deviation If is a random variable then its standard deviation is given by the square root of its variance. The standard deviation gives an idea of what is ‘normal’ for a given distribution. If we only consider ‘normal’ those values which are equal to the expected value then this is too narrow for most purposes. If the average height in a given population is 167cm, then we don’t consider somebody who measures 168cm far from the norm. Typically values which are within one standard deviation on either side of the ave considered ‘normal’. If the standard deviation is large that means that there are a lot of data points away from the expected value, and we should not have too narrow an idea of what is ‘normal’. 229 Example 4.95. In the Example 4.94 the standard deviation is √ 1.75 ≈ 1.32. This means that for the coin example, almost anything is normal. If we increase the number of coin tosses that changes. The standard deviation can be thought of as giving us a measurement of the variability of the possible values of a random variable. For some purposes the variance (which is closely related) has nicer properties. These ideas will appear in the unit on data science in an early lecture. Related ideas are to use data gathered to calculate sample variance, also known as empirical variance. Exercise 95. (a) Show that for a random variable with expected value the variance is (2)− 2. (b) Show that if and are independent random variables we have that the variance of + is the variance of plus the variance of . Hint: You may want to use part (i). 4.5 Averages for algorithms A very important application of expected values in computer science is that of the average complexity of an algorithm. You will meet this idea in COMP11212 and COMP26120 (and COMP36111. Mathematically it is quite tricky to make precise the average that is formed here. In subsequent course units you will not see formal derivations of the average complexity of an algorithm, and the examples we study below give you an idea why that would take up a great deal of time. The examples we do look at stand serve as case studies that illustrate the procedure. 4.5.1 Linear search Assume you have an array of integers (for example of student id numbers, pointing to the student le). Assume you are trying to nd a particular id number in that array. A simple-minded algorithm for doing this will look at all the possible values in the array until the given number is found. Code Example 4.1. Here’s a code snippet that implements this search idea. for (int index=1; index < max_index; index++) if (array[index]=given_number) ... This algorithm is known as linear search. How many times is the algorithm going to perform look-up for the array on average? In other words, how often will array[index] be invoked? We begin by looking at an example. Example 4.96. If the array has 8 entries then the chance that the entry we are looking for is any one of them is 1/8. If we are lucky, and we nd the entry on the rst attempt35 at array[1] then we have needed one look-up, whereas if we have to keep checking until we reach array[8] we need 8 look-ups. We 230 have a random variable which takes its values in {1, 2, 3, 4, 5, 6, 7, 8}, and each of these values occurs with the same probability, namely 1/8. Hence the expected value for this random variable is 1 · 1 8 + 2 · 1 8 + · · ·+ 81 8 = 8∑︁ =1 1 8 = 1 8 8∑︁ =1 = 1 8 · 8(8 + 1) 2 = 8 + 1 2 . This means we have to expect 4.5 look-ups on average. Of course most real-world applications have considerably larger arrays. For this reason it pays to think about the general case. Example 4.97. We now assume that we have an array with entries, and that the chance of the searched for entry being in any of the positions is the same, namely 1 . We apply the same algorithm as before, namely looking at each entry until we nd the one we are looking for. In particular note that we are implicitly assuming that not nding the entry in the rst position does not tell us anything about the probability of it being in the second (or any other) position. If the looked-for entry is the rst entry of the array then we need one look-up operation, if it is the second entry we need two look-ups, and so on until the end of the array. So we have a random variable that can take values in the set {1, 2, . . . , }, and for which the probability that any one of them occurs is 1/. Hence the expected value for this random variable is 1 · 1 + 2 · 1 + · · ·+ 1 = ∑︁ =1 1 = 1 ∑︁ =1 = 1 (+ 1) 2 = + 1 2 . In other words we have to look through roughly half the array on average before nding the looked-for entry. You might have been able to work this 35Typically arrays start at index 0 but for our example it makes life less complicated if we start at index 1. 231 out without any knowledge of random variables, but we can now put these ideas on a rm mathematical footing. People who study algorithms are also interested in the worst case which in this example is that we have to perform look-up operations until we nally nd our number. So the average case of the algorithm is that the number of look-ups required is roughly half the size of the input, whereas the worst case is that it is the size of the input. 4.5.2 Binary search In the above example we were using an algorithm that is not particularly clever. If the entries appear in the array sorted by their size then we can do much better. Assume we are trying to solve the same problem as in the previous example, but this time we have an array whose entries are sorted. In that case we can come up with a faster algorithm eectively by making use of this extra information. Here’s the idea:36 The rst index we try is the one halfway through the array, say the 4th entry. If the entry at that position is the one we were looking for then we are done. If not, then if the entry at that position is below the one we are looking for then we know that the looked-for entry has to be to the right of the current position at a higher index, else to the left at a lower index. Of course we might be really lucky and have found our entry already! We now apply the same trick again: We nd an entry roughly halfway through the appropriate half of the array. If the entry at the current position is below the one we are looking for. . . . What’s the expected number of look-ups required for this algorithm? What we do on each step is to look up one entry, and split the remaining array in two parts whose sizes dier by at most 1. We look at a concrete example to better understand the situation. Example 4.98. Again we assume that we have an array of size 8. Say our array looks as follows: 1 2 3 4 5 6 7 8 1 3 4 7 15 16 17 23 If we look for the entry 17 we perform the following steps: • We look at the entry at index 4, where we nd the entry 7. This is smaller then the entry we are looking for. 1 2 3 4 5 6 7 8 1 3 4 7 15 16 17 23 17 We know that if our number is in the array it has to be to the right of the index 4. 232 • On the next step we look halfway through the indices 5, 6, 7 and 8. There are 4 entries, so (roughly) halfway along is at index 6. We nd the entry 16, which is again smaller than the one we are looking for. 1 2 3 4 5 6 7 8 1 3 4 7 15 16 17 23 17 • We now have to look halfway along the indices 7 and 8. There are two entries, so halfway along is at index 7. We have found the number we were looking for, 1 2 3 4 5 6 7 8 1 3 4 7 15 16 17 23 17 Here is a description of the algorithm when looking for an arbitrary number in this array: We assume that we cannot be sure that the entry is in the array at all (somebody might have given us an invalid id number). On the rst step we look up the entry at index 4. If this doesn’t give us the entry we were looking for then this leaves us with • either our entry is smaller than the one at index 4, so if it is there it must be at indices 1, 2 or 3, in which case – we look up the entry at index 2, and if we are not successful then ∗ if our entry is below that at index 2 we look up index 1 or ∗ if our entry is above that at index 2 we look up index 3, or • our entry is greater than the one at index for, so if it is there at all it must be at indices 5, 6, 7 or 8, in which case we – look up the entry at index 6, and if that is not the correct one then ∗ if our entry is smaller than that at index 6 we look at index 5 ∗ if our entry is greater than that at index 6 we look at index 7. · and if it is not at index 7 we look at index 8, This information is more usefully collected in a tree. Here the nodes are given labels where 233 • the rst part is a list of indices we still have to look at, then there is a colon and • the second part is the index we are currently looking at. [1, 2, 3, 4, 5, 6, 7, 8] : 4 [1, 2, 3] : 2 [1] : 1 1/3 done 1/3 [3] : 3 1/3 3/8 done 1/8 [5.6.7.8] : 6 [5] : 5 1/4 done 1/4 [7, 8] : 7 done 1/2 [8] : 8 1/2 1/2 1/2 We can see that in the worst case we have to look at indices 4, 6, 7 and 8, which makes four look-ups. We can also calculate the expected value for this situation: • The probability that we need only one look-up is 1/8; • we need two look-ups with probability 3/8 · 1/3 + 1/2 · 1/4 = 2/8; • we need three look-ups with probability 3/8 · (1/3 + 1/3) + 1/2 · (1/4 + 1/2 · 1/2) = 4/8; • we need four look-ups with probability 1/2 · 1/2 · 1/2 = 1/8. Hence the expected value for the number of look-ups is 1 · 1 8 + 2 · 2 8 + 3 · 4 8 + 4 · 1 8 = 21 8 = 2.625. Again we want to analyse the general case of this algorithm, which is known as binary search. Example 4.99. From the example above we can see that some cases are easier to analyse than others: If the elements of the array exactly t into a tree then the calculation becomes much easier. If we look at the example of eight indices we can see that 7 indices would t exactly into a tree with three levels of nodes. We can also see that we don’t need to have separate nodes labelled ‘done’; instead, we can just use the parent node to record that the search is over. In the case where there are seven entries in the array we could calculate the expected value using the following tree, where now we only list the index that is currently looked up for each node: 36I think you will have seen this if you have been at one of our Visit Days. 234 42 1 3 6 5 7 The node on the top level requires one look-up, the two nodes on the second level require two look-ups, and the four nodes on the third level require three look-ups. Each of those nodes will be equally likely to hold our number. Hence we can see that the average number of look-ups is 1 · 1 · 1 7 + 2 · 2 · 1 7 + 3 · 4 · 1 7 = 17 7 ≈ 2.43. We can generalize the idea from the preceding example provided that the number of indices is of the form 20 + 21 + · · ·+ 2−1 = −1∑︁ =0 2 = 2 − 1. We can think of the situation as being given as in the following tree, where on each level we give the number of look-ups required. 1 2 3 ... ... 3 ... ... 2 3 ... ... 3 ... ... The expected number of look-ups can be described by a function : N N which behaves as follows: (2+1 − 1) = 1 + (2 − 1), because if we have an array with 2+1 − 1 elements, which exactly t into a tree with + 1 layers, we need one look-up, and are then left with a tree with layers, which requires (2 − 1) look-ups. This kind of description of a function is known as a recurrence relation, and we look at simple cases for solving these in Section 6.4.5, and give a few further examples in Chapter 8. Here we can analyse the situation fairly easily: We start counting the levels from the top, starting with level 0 and ending at − 1. Then level has 2 nodes 235 each needing + 1 look-ups each holding the right value with probability 1 2 − 1 . Hence the expected value of the number of look-ups is −1∑︁ =0 (+ 1) 2 2 − 1 = 1 2 − 1 −1∑︁ =0 (+ 1)2. To check that we have derived the correct formula we can look at the case for = 3, that is seven entries in the array, and compare the result we get from the formula with the one calculated above. The formula gives approximate 2.43 look-ups which agrees with the result previously calculated. We give a few (approximate) values of this sum: 3 4 5 6 7 8 7 15 31 63 127 255 exp no look-ups 2.43 3.27 4.16 5.1 6.06 7.03 As grows large the sum given above approximates , If we only have values for arrays of sizes of the form 2 − 1, do we have to worry about the other cases? The answer is that one can show with a more complicated analysis that for an array with entries, we require approximately log look-ups, even if is not of the shape − 1. This means that the average number of look-ups for an array with entries is approximately log . We note that the worst case for this algorithm is that we have to look up one node on each level in the tree, which means that in the worst case the number of look-ups is the height of the tree, which is log . So here we are in a situation where the average case is the same as the worst case! Occasionally it is easier to analyse particular problem sizes, and as long as the values for other values deviate in only a minor way from the function so deduced, this is sucient for most purposes in computer science. You will learn in COMP11212 that we are typically only interested in the ‘rate of growth’ of a function describing the number of instructions required for a given problem size, and that all other aspects of the function in question are dropped from consideration. Often when looking at issues of complexity it is sucient to have approxim- ate counts, and more generally we only care about how quickly the number of instructions grows as grows large. We look a little into how one can measure the ‘growth’ of a function in Section 5.1. EExercise 96. Assume you have an array whose entries are natural numbers, and you are given a natural number that occurs in the array. You want to change the order of the entries in the array in such a way that it satises the following two conditions: • All numbers which are less than occur to the left of and • all numbers which are larger than occur to the right of . This is a part of an important sorting algorithm called Quicksort. In what follows we make the assumption that the number occurs in the array exactly once.37 The way this algorithm is implemented is as follows: 236 • There are two pointers, low and high. • At the start the low pointer points to the lowest index and the highest pointer points to the highest index. • You start a loop. This loop runs until the low pointer and the high pointer point at the same entry. – Look at the entry the low pointer points to. ∗ If the entry is less than then increase the low pointer by one, check that it has not reached the index of the high pointer, and repeat. ∗ If the entry is greater than or equal to then do the following. · Look at the entry the high pointer points to. · If the entry is greater than then decrease the high pointer by one, check that it has not reached the low pointer, and repeat. · If the entry is less than or equal to then swap the two entries. – Repeat, looking again at the low pointer. (a) Carry out this algorithm for the following array and = 17. 1 2 3 4 5 6 7 8 19 2 17 5 1 27 0 31 How many times does the algorithm ask you to swap elements? Now look at carrying out the algorithm for an arbitrary array of size . (b) In the best case, how many times does the algorithm have to swap ele- ments? Justify your answer. (c) Assume you have an array with ve elements. In the worst case, how many times does the algorithm have to swap elements? Try to generalize your idea to an array with elements. Justify your answer by describing how to construct an array where the worst case will occur. (d) Assume that the element occurs in the middle of the array and that the array has an odd number of entries. What is the average number of swaps the algorithm has to perform if you may assume38 that you have an array with the property that given an arbitrary element, • the probability that it is less than is 1/2 and • the probability that it is greater than is 1/2? Write one sentence about how this changes if the probability that an arbitrary element of the array is less than is , and the probability that it is greater than is 1− . 237 (e) On average, how many times does the algorithm have to swap elements if you may assume everything from the previous part, with the exception that the element is located in the middle of the array? (f) Can you say how many times the algorithm has to swap elements on average if you are not allowed to make this assumption, but if the element still occurs in the middle of the array? Note that this is a tricky exercise, and its main point is to show how dicult it is to properly calculate the average complexity of any algorithm. Hint: For any of the parts from (b) onwards if you struggle to work out the general situation try some small arrays to see whether you can see what happens. You can see from the examples given, however, that a proper analysis can be quite tricky (the cases discussed above are relatively simple ones), and that one often has to make decisions about using approximations. When people claim that an algorithm has, say an average case quadratic complexity then this has to be read as an approximate description of its behaviour as the input grows large. The above preceding four examples give you an idea of what is meant by ‘average number of steps’. Note that the typical assumption is that every possible conguration is equally likely (that is in our example that the sought-for number is equally likely to occur at any given index in the array), and that these assumptions are not always justied. 4.6 Some selected well-studied distributions In many situations it is hard to determine the probability distribution of a given random variable from the given data. In those cases it is standard to make the assumption that it behaves according to some well known distribution. Clearly if this assumption is not justied then any calculations based on it are not going to be of much practical use. When you are asked to cope in such a situation you should, at the very least, think about what you know about the given situation and which well-known distribution this suits best. We here give an overview of only a very small number of distributions. There is plenty of material available on this topic, and so there is no need to add to that. 4.6.1 Normal distributions Normal distributions are used on many occasions. They are continuous probability distributions— although ‘normal distribution’ refers to a whole family people often use this term in the singular. In its simplest form the probability density function of a normal distribution is given by R R 1√ 2 −2/2. 37Although it also works if the number doesn’t occur at all in that you get a block of numbers less than followed by a block of numbers greater than . 38The assumption is equivalent to assuming that there are as many numbers below in the array as there are numbers greater than or equal to . 238 −2 −1 0 1 2 0.2 0.4 0.6 0.8 The expected value of a random variable with this probability density function is 0, and the standard deviation is 1. It is possible to create a normal distribution for a given expected value and a given standard deviation. Let and be real numbers, where > 0. Then a random variable with probability density function R R 1 √ 2 (−)2/22 has expected value and standard deviation . One of the reasons that this is such a useful distribution is that, under fairly general assumptions, it is the case that the average of a (large) number of random variables which are independent and have independent distributions converges against having a normal distribution. For this reason random variables that are created from a number of independent processes obey a distribution which is close to a normal distribution. You will meet this idea once again in COMP13212, and we have just summarized the reason why the normal distribution often appears in applications. Normal distributions are known to occur in the natural work, for example as the velocities of molecules in an ideal gas. There are many resources available to study phenomena which follow these distributions. 4.6.2 Bernoulli and binomial distributions We have used Bernoulli distributions already without naming them. Given a random variable with two possible outcomes, say and ′ in R, to give a probability distribution of the random variable it is sucient to determine ( = ), and all other probabilities are then uniquely determined (compare Corollary 4.13). In particular we know that ( = ′) = 1− ( = ), since the probability of all possible outcomes have to add up to 1. Typically for a Bernoulli distribution we assume the only possible values of the random variable are 0 and 1. 239 Example 4.100. Tossing a coin is an experiment that follows a Bernoulli dis- tribution, where one of head or tails is assigned the value 1, and the other the value 0. You can think of this as the random variable that counts the number of heads (or tails) that appear in a single coin toss. To make the notation less tedious, assume that ( = 1) = . The expected value of this distribution is given by 0 · (1− ) + 1 · = , and the variance is (( − ))2) = (2 − 2 + 2) = (02 − 2 · 0 + 2)(1− ) + (12 − 2 · 1 + 2) = 2(1− ) + (1− 2+ 2) = 2 − 3 + − 22 + 3 = − 2 = (1− ). The binomial distributions arise from assuming an experiment with a Bernoulli distribution is carried out repeatedly, in a way where the previous incarnations have no inuence on the following ones, such as tossing a coin a number of times, and adding up the results (for example the number of heads that appear). You can nd the description of the pmf, expected value, and standard deviation for these distributions from many sources, including online. 4.6.3 The Poisson distribution The Poisson distribution is a discrete distribution that applies to process of a particular kind, namely ones where • we look at the probability of how many instances of a given event occur within a given time interval or a given space, • we know the average rate for these events and • the events occur independently from the time of the last event. Typical examples are:the following. • The number of births per hour on a given day. • The number of mutations in a set region of a chromosome. • The number of particles emitted by a radioactive source within a given time span. • The number of sightings of pods of dolphins along a given path followed by an observing plane. 240 • Failures of machines or components in a given time period. • The number of calls to a helpline in a given time period. It is assumed that the expected number of occurrences (on average) of the event in the given time frame is known, so assume this is given by ∈ R+. A random variable obeying the Poisson distribution has the pmf ( = ) = − ! . Its expected value is , which is also the variance. Example 4.101. Assume we have motherboards for which it is known that on average, .5% are faulty. If we pick a sample of 200 motherboards, what is the probability that three of them are faulty? From the given data we would expect .005× 200 = 1 to have one faulty board on average in such a sample, but this does not tell us how to answer the question about the probability that we have three of them. If we assume that this event follows the Poisson distribution then we get ( = 3) = 13−1 3! ≈ .06, so the probability is 6%. 4.6.4 Additional exercises We look at situations here for which I don’t want you to make any assumptions about which part of the notes you should use to solve them. Exercise 97. Consider the following marking scheme for multiple choice ques- tions: Each question has precisely one correct answer of four choices given, and students may pick as many of the available choices as they like. The marking scheme is as follows: For choosing the correct answer the student gets three marks, and for each chosen incorrect answer the student loses a mark. Show that if a student randomly chooses how many alternatives to include, and which ones those should be, the number of marks they get is 0. Exercise 98. A lecturer believes that students have a better chance of doing well on their unit if they also take another unit at the same time. He looks at the numbers from the past academic year to see whether he can nd statistical evidence for his belief. In the past year he had 200 students on his course of which 40 got a very good mark. Of the student on his course 67 were enrolled on the other unit in question, and of these 27 receive a very good mark. Do you think he is right in his belief? Exercise 99. Assume you have a line with 11 points points from−5 to 5. There is an ant at point 0. −5 −4 −3 −2 −1 0 1 2 3 4 5 Assume that with probability 1/2 the ant moves one point to the left, and 241 with probability of 1/2 it moves one step to the right. If it wants to make a step that causes it to leave the grid it stops. This exercise requires a lot of calculations and is a bit ddly in places. (a) What is the probability that the ant will have stopped after 10 steps? (b) What is the expected position of the ant after 10 steps? Exercise 100. This exercise is a generalization to 2 dimensions of the previous one, so you may want to solve that rst. Assume you have an eleven-by-eleven grid, which we may give coordinates from from (−5,−5) to (5, 5) as in the following picture. There is an ant on the grid, initially in position (0, 0). Assume that with the probability of 1/4 the ant selects a direction from {,, ,} and takes one step in that direction. If it wants to move in a direction that would cause it to leave the grid it stops. (a) What is the probability that the ant has stopped after ten steps? (b) What is the expected position of the ant after ten steps? Assume that the ant is not allowed to change direction by more than 90 degree on each step, and that each of the possible three directions is equally likely. (c) What is the average distance that the ant will have from the starting point after ve steps? (d) What is the probability that the ant will have have stopped after eight steps? Exercise 101. Assume you are looking after a cluster containing 50 machines. One of your machine has been aected by an odd virus. Its behaviour is as follows: • It randomly picks one of the other 49 machines in the cluster. It copies itself to that machine. It then becomes inert. • If a machine that was infected previously becomes infected again it 242 behaves as if it hadn’t been infected before, that is, the virus is copied to one machine randomly picked from the other 49 machines in the cluster. (a) What is the probability that after eight infection steps, the number of infected computers is 8? (In other words, no computer has been infected twice.) (b) What is the expected number of infected computers after 5 infection steps? Hint: draw a tree where each node is labelled by the number of machines currently infected. On the rst step there is one such machine, on the second step there are two (you may want to think about why), and after the third step there can be two or three. (c) Picture the tree that would fully describe the possible numbers of infected machines after 50 steps. How many paths in that tree lead to exactly three machines being infected? What is the probability for each of those paths? CExercise 102. Calculate the expected values asked for in the following situ- ations. Make sure you give a full calculation, not just a number, and be prepared to explain your calculation. (a) You are staying at a guest house with seven rooms. You know from chatting to the owner that three rooms have couples staying, two rooms have singles, and one room is empty. At the breakfast buet you get to know one of the other guests. What is the expected number of occupants of their room? (b) You have lined up 10 pound coins. You ip each one of them, and then move to one side the ones that show heads. You ip the remaining ones again, and once more move to one side the ones that show heads. You ip the remaining ones again and once more move those showing heads to one side. How many coins do you expect to have put aside altogether? (c) Assume you are oered the following game: You roll a die. You can decide to stop here and get the number of points shown on the die, or you can roll it again. After the second roll you again have the choice to obtain the number of points shown on that roll, or to roll one nal time. Describe the strategy that maximises the expected number of points you win in this game, and give the number of points you may expect. Exercise 103. Assume you have an animal that lives on the real interval from 0 to 1, and it is equally likely to be any of these locations. Now assume we have a second animal of this kind. What is the expected distance between the two? 243 Chapter 5 Comparing sets and functions In computer science we are interested in comparing functions to each other because when we decide which algorithm to choose we want to pick the one that shows the better behaviour for the given range of inputs. By ‘better behaviour’ we mean an algorithm that performs faster for the given inputs. As you will see in COMP11212 when we do this we only compare such functions regarding how fast they grow, and one of the aims of this section is to introduce that idea. We also have to be able to compare sets with each other. In Chapter 4 there is frequently a distinction between three cases regarding random processes into • those with a nite number of outcomes and • those with a countable1 number of outcomes and • those we consider continuous. This chapter makes these ideas formal. There are other applications for these ideas, and we sketch one here: • There are countably many Java programs. • There are uncountably many functions from N to N. This mismatch tells us that there are some functions from the natural numbers to the natural numbers which cannot be implemented by a Python or Java program. 5.1 Comparing functions In Section 4.4.6 we discuss how to calculate the number of instructions that a program has to carry out on average. It is a rst step to analysing the eciency of an algorithm. Sometimes we have a choice of programs (or algorithms) to solve a particular problem. For small problem sizes it won’t matter too much which one we pick, but as the size of our problem grows (for example, sorting millions of entries in some array as opposed to a few tens) we need to seriously think about what is the best choice. It might be the case that some programs take so long (or requires so many resources in the form of memory) that one cannot feasibly use them. To measure the eciency of programs it is standard to count the number of some instructions that measures how long the program is taking, depending on the size of the problem. The question then is how to compare such functions. 1These are the ones where the set of outcomes may be described in the form = { | ∈ N}. 244 5.1.1 One function dominating another Examples 4.97 and 4.99 in Chapter 4 give a measure of eciency of two algorithm, the rst one being known as linear search while the second is called binary search. For these algorithms we counted the number of look-up operations performed to measure their complexity. For an array with entries, the former has an average number of (+ 1)/2 look-ups to perform, while the latter requires approximately log look-ups. We picture the corresponding functions by drawing their graph when viewing them as functions from R+ (or a subset thereof) to R+ instead of as functions from N to N. In the following graph consider the two functions given [1,∞) R+. + 1 2 log We can see that for every input value binary search requires fewer look-ups than linear search. In this case it looks like an easy choice to make between the two. However, we have to bear in mind that binary search requires the given array to be sorted, and that does require additional computation time and power. The picture suggests a denition for comparing functions. Denition 41: dominate Let be a set of numbers, N, Z, Q or R, and let and be two functions from a set to . We say that dominates (or is above ) if and only if for all ∈ it is the case that ≥ . When we draw the graphs of two functions where one dominates the other we can see that the graph of the rst is entirely above the graph of the second (but the graphs are allowed to touch). 5.1.2 Eventual domination Often the denition we have just given does not give us good guidance regarding which algorithm to choose. Example 5.1. Consider the following three functions from [1,∞) to R+. 245 2 2 + 1 log 1 The function 2 dominates the function /2 + 1 which in turn dominates the function log . But this notion is not sucient for the intended application. If we want to establish whether one program outperforms another then using this idea for, say, the functions giving the number of instructions as a function of the size of the input for each program, may not give a useful result. Consider the functions below, going from R+ to R+. /2 + 1 2 Neither function dominates the other. But clearly if the problem size is large (that is, we move to the right in the graph) then the function /2 + 1 oers a much preferable solution. This idea is encapsulated by the following denition.2 2You will meet the following denition again in COMP11212, and COMP21620. 246 Denition 42: eventually dominate Let and ′ be sets of numbers from N, Z, Q or R, and let and be two functions from to ′. We say that eventually dominates if and only if there exists ∈ R such that for all ∈ with ≥ we have ≥ . We can think of this denition as saying that dominates if we restrict the source of and to { ∈ | ≥ }, or if we only look at the graphs of the two functions to the right of . Note that there is no need to nd the smallest ∈ N with this property—any such will do! Typically when we are interested in one function eventually dominating another in computer science, we are interested in functions from the natural numbers to some subset of the real numbers. When we try to draw the graph of such a function it is easier to draw it as a function from the real numbers to the real numbers. It is not a priori clear what happens when we change the source set of the function. Note that if is a function from some set to a subset of the real numbers then there is a very closely related function whose target is R, given by R . The following result gives us information about extending the domain of denition of our function. Proposition 5.1 Let be a set of numbers from N, Z, Q or R, and let and be functions from to R. Assume that ′ and ′ are functions from R to R such that • ′ restricted to inputs from is and • ′ restricted to inputs from is . If ′ eventually dominates ′ then eventually dominates . Proof. If ′ eventually dominates ′ then we can nd ∈ R such that for all ∈ R with ≥ we have that ′ ≥ ′. To show that eventually dominates , assume we have ∈ with ≥ . We know the following: = ′ assumption about ′ ≥ ′ ≥ = assumption about ′. 247 Hence we may argue with suitable functions from the real numbers to the real numbers. Example 5.2. Consider the two functions : N N 2 : N N 4+ 5. Again we use graphs to picture the situation,3 where we treat both expres- sions as functions from the non-negative reals to the reals. The preceding proposition tells us that considering the graphs tells us something about the functions given. 1 25 1 5 2 4+ 5 Once the two lines have crossed (at = 5) the graph of stays above that of . This suggests that we should try to nd a proof that eventually dominates . • First of all, we have to give a witness for the ‘exists’ part of the statement. The graph helps us to choose = 5, but note that every natural number larger than 5 would also work. • Now that we have we have to show that for all ∈ N, with ≥ , we have ≤ . So let us assume that ∈ N, and that ≥ 5. Then = 4+ 5 ≤ 4+ 5 ≤ , Fact 7 = 5 ≤ · 5 ≤ , Fact 7 248 = 2 = as required. Example 5.3. Here’s an alternative way of proving the same statement. • Again, we have to give a , but assume this time we have not drawn the graph. We have to guess a such that for all ≥ we have 4+ 5 ≤ 2. We can see that we require a number such that multiplying with is at least as large as multiplying with 4 and adding 5. Say we’re a bit unsure, and we are going to try to use = 10 to be on the safe side. • We have to show that for all ∈ N if ≥ 10 then = 2 ≥ 4+ 5 = . So assume ≥ 10. We work out that = 4+ 5 ≤ 4+ 10 5 < 10, Fact 7 ≤ 4+ 10 ≤ , Fact 7 = 5 ≤ 10 5 < 10, Fact 7 ≤ 2 = 10 ≤ , Fact 7. Note that the shape of the proof has not changed much at all. Example 5.4. We give another variation on this proof. • Assume we use = 10 again, but this time we produce a proof where we start by looking at the larger function. • Let ≥ 10. Then = 2 = · ≥ · 10 ≥ 10, Fact 7 = 4+ 6 ≥ 4+ 60 ≥ 10, Fact 7 ≥ 4+ 5 = 60 > 5, Fact 7. Example 5.5. Another variant of a proof of the same statement: Instead of using the assumption ≥ for whichever we pick we express this as writing = + , where is an element of N. 3But note that drawing graphs by hand can be time-consuming, and that in order to help with answering the question whether one function is eventually dominated by another a quick imprecise sketch can be sucient. 249 For = 5 the proof could then go like this: = (5 + ) = 4(5 + ) + 5 def = 25 + 4 calculations in N ≤ 25 + 10 4 ≤ 10, Fact 7 ≤ 25 + 10+ 2 0 ≤ 2, Fact 7 = (5 + )2 calculations in N = (5 + ) = def . You can see from these examples that there are typically many ways of proving the desired statement. Dierent strategies are outlined in those examples, and you can pick whichever one you prefer in order to solve these kinds of questions. CExercise 104. Determine whether one of the two functions given eventually dominates the other. Give a justication for your answer. You should not use advanced concepts such as limits or derivatives, just basic facts about numbers. (a) log(+ 1) and as functions from R to R. (b) log(+ 1) and 2 as functions from R+ to R+. (c) 2 and 1, 000, 000 as functions from Z to Z. (d) sin and cos as functions from R to R. 5.2 Comparing sets In the introduction to this chapter we have argued that it is important to be able to compare the sizes of dierent sets. It turns out that the notions of injective and surjective functions from Section 2.6 is useful for this purpose. 5.2.1 Comparative sizes of sets In particular, if there is an injective function from a set to a set , then for every element of there is an element of , and all these elements are dierent. Hence we know that all elements of ‘t into’ , and must be as least as big as . Denition 43: comparison of set size Let and be sets. We say that the size of is smaller than or equal to that of if and only if there is an injection from to . Example 5.6. We have an injection {0, 1, 2, 3, 4} {0, 1, 2, 3, 4, 5}, which is given by the assignment , 250 which is clearly an injection. So the size of the set {0, 1, 2, 3, 4} less than or equal to the size of the set {0, 1, 2, 3, 4, 5}. This may seem like a trivial observation. Our denition really only comes into its own once we consider innite sets. Lemma 5.2 If and are sets with nitely many elements then the size of is less than or equal to the size of if and only if the number of elements of is less than or equal to the number of elements of . Proof. Note that in Exercise 39 it is shown that the number of elements in the image of a set under an injection is the same as the number of elements of . We show both implications separately. • Assume that the size of is less than or equal to the size of . Then there is an injection, say , from to . By Exercise 39 we know that the number of elements of the image [] of under is the same as the number of elements of . Since [] is a subset of we know that has at least as many elements as . • Assume that the number of elements of is less than or equal to the number of elements of . This means that if we name the elements of , say 1, 2, . . . , , and those of , say 1, 2, . . . , then ≥ . If we now dene the function {1, 2, . . . , } {1, 2, . . . , } from to it is an injection. Here is an example with innite sets. Example 5.7. The natural numbers N can be mapped via an injection into the integers Z by dening . This is clearly an injection. Hence the size of N is less than or equal to the size of Z. If I had asked in the lecture whether the size of Z is at least that of N I am sure everybody would have told me that this is true. You may nd the following example less intuitive. It shows that once we have sets with innitely many elements our intuitions about their sizes become suspect. 251 Example 5.8. What you might nd more surprising is that Z also has a size smaller than or equal to that of N. We give an injection : Z N by setting4 {︃ 2 if ≥ 0 −(2+ 1) else. This is an injection for the following reason. Let and in Z. We have to show that = implies = . Since the denition of is by cases we have to distinguish several cases in this proof. • ≥ 0 and ≥ 0. If 2 = = = 2 we may conclude = . • < 0 and < 0. If −(2 + 1) = = = −(2 + 1) we may conclude that 2+ 1 = 2+ 1 and so = as required. • ≥ 0 and < 0. If 2 = = = −(2+1) we get 2 = 2+1 which can never hold for , in Z. • < 0 and ≥ 0. This case is identical to the previous one where and have been swapped. Exercise 105. Show the following statements. (a) The size of every set is less than or equal to itself. (b) If the size of the set is less than or equal to the size of the set , and if the size of the set is less than or equal to the size of the set then the size of is less than or equal to the size of . This means that we have dened a reexive and transitive binary relation, which means we can think of it as a kind of order. This idea is looked at in more detail in Chapter 7.5. 5.2.2 Two sets having the same size What does it mean that N is at least as big as Z, and Z is at least a big as N? It means that they can be thought of as having the same size. Denition 44: same set size We say that two sets and have the same size if and only if • the size of is less than or equal to the size of and • the size of is less than or equal to the size of . The previous two examples show that N and Z have the same size. You may wonder how strongly two sets are connected that have the same size. The answer is given by a theorem known as Cantor-Schröder-Bernstein, or sometimes without Cantor’s name attached to it. It says that if two sets have the same size then there is a bijection between them. 4Compare this to the function from the mid-term test in 2015/16, 252 Optional Exercise 18. Prove the Cantor-Bernstein-Schröder Theorem, or nd a proof somewhere and try to understand it. There are quite a few online references available. EExercise 106. Show that the following sets have the same size. (a) N and N× N; (b) N and N where is a nite number. (c) N and Q. (d) the set of functions from some set to the two element set {0, 1} and the powerset of . For the sets N and Q, as well as the powerset construction , only use facts from Chapter 0. Exercise 107. Show that if there is a bijection from to then and have the same size. You may have wondered why we used injections to determine the size of a set, and whether we could not have done this using surjections. The following exercise answers that question. Exercise 108. Show that given a function : the following are equi- valent: (i) is a surjection and (ii) the size of is at least the size of . 5.2.3 Innite sets We give a formal denition of innity based on a notion known as Dedekind innite. Denition 45: innite set A set is innite if and only if there is an injection from to a proper subset of . Proposition 5.3 A set is innite if and only if there is an injective function from to itself which is not surjective. Proof. We show the statement in two parts. Assume that the set is innite. Then there is an injective function : ′ where ′ is a proper subset of . We can dene a function : 253 which is obviously also injective, but it is not surjective since we know there is an element of which is not in ′, and so cannot be in the image of . Assume that we have an injective function : which is injective but not surjective. Then there is an element of which is not in the image of , that is, there is no ′ ∈ with ′ = . We dene a new function : ∖ {} . We note that is injective since is, and we note that its image is a proper subset of . Example 5.9. We show that there are innitely many Python programs. We have to give an injective function from the set of Python programs to itself whose range does not include all Python programs. We do this as follows: Given a Python program we map it to the same Python program to which the line print("Hello world!") has been added. This function is injective: If we have two Java programs that are mapped to the same program then they must be the same program once that new last line has been removed. This function is not surjective since there are many programs which do not contain that line and so are not in the image of the function. Hence this assignment from the set of all Java programs to itself is injective but not surjective, and so this set is innite by Proposition 5.3. Example 5.10. We show that the set N of nite subsets of N is innite. We give an injective function from the to itself and show that it is not surjective. Given a nite non-empty subset {1, 2, . . . , } of N we map it to the set {1, 2. . . . , (1 + 2 + · · ·+ )}, and we map the empty set to itself. In other words we have N N {1, 2, . . . , } {︃ {1, 2, . . . , , (1 + 2 + · · ·+ )} > 0 ∅ else. 254 This assignment maps a given non-empty set to the set where the sum of all the elements of that set has been added as an extra element. We observe that the extra element is always the largest element of the resulting set. Note that if the set we start with has only one element then it is mapped to itself by this function since no extra element is added. This function is injective. If two sets are mapped to the same set then in particular their greatest elements must be equal, so the original sets must have had elements which add up to the same number. Moreover, all elements (if any) of the set which are below the largest element must also correspond to each other, so the sets must have been equal and our function is injective. This function is not surjective since the set {1, 2} is not in the image of this function. By Proposition 5.3 we know that the given set is innite. We show that all innite sets are at least as big as the set of natural numbers N. Proposition 5.4 If is an innite set then there is an injection from N to . Proof. Let : be the function that shows that is innite, that is, we assume that is injective but not surjective. Pick an element of which is not in the image of . We dene a function : N as follows: We set 0 = , 1 = 0 = , 2 = 1 = , 3 = 2 = , . . . More generally, we set = +1, where the power indicates applying the given number of times. We have to show that the resulting function is injective. If we have and in N with = , then this means +1 = = = +1, since is injective we can conclude from this that = , and we can continue removing on both sides until we have deleted all s on one side of the equality. This means we have = 255 for some ∈ N. But is not from the image of , so we must have that = 0, and so removing many s on one side is the same as removing many s on the other side, which means we must have = . This means that the size of N is less than or equal to that of every innite set. Since N itself is innite (see Exercise 110) in this sense N is the smallest innite set. To ensure that our notion of innity ts well with our notion of comparing the sizes of sets we establish the following proposition. Proposition 5.5 If the size of N is less than or equal to the size of a set then is innite. Proof. Let : N be the injective function which establishes that the size of N is less than or equal to the size of . We split into two parts as follows. Let 1 = { ∈ | there is ∈ N such that = } and 2 = { ∈ | for all ∈ N ̸= .}. Then = 1 ∪ 2, since every element is either in the range of , and so in 1, or it is not in the range of and so in 2. Note that 1 and 2 are disjoint, so every element of is either in 1 or in 2, but no element can be in both sets. Note that since is injective, for every in 1 there is a unique ∈ N with = . Based on this we dene a function from to itself as follows. = {︃ (+ 1) ∈ 1, = . We claim that this function is injective, but not surjective. To show injectivity we have to consider four cases, similar to Example 5.8. Let and ′ be elements of . • ∈ 1 and ′ ∈ 1. There there exist unique elements and ′ in N with = and ′ = ′ and if (+1) = = ′ = (′+1) then by injectivity of we have = ′, and so = = ′ = ′. • ∈ 2 and ′ ∈ 2. If = = ′ = ′ we immediately have = ′. • ∈ 1 and ′ ∈ 2. We know that there exists a unique ∈ N with = . But now = (+ 1) is an element of 1, while ′ = ′ is an element of 2 and so the two cannot be equal. • ∈ 2 and ′ ∈ 1. This case is identical to the previous one where and ′ have been swapped. We can see that the function maps the set 1 to itself, while it maps 2 to itself as well, leaving every element as it is. The function is not surjective since no element is mapped to 0: 256 Clearly 0 is in 1, so by the previous observation if it were in the image of there would have to be an element ∈ 1 with = 0. But for any such we know that there exists a unique ∈ N with = , and so we would have = (+ 1) = 0, which by injectivity of would imply +1 = 0, but no such number exists in N. Exercise 109. Show that if a set has a nite number of elements then it is not innite. CExercise 110. Show that the following sets are innite by proving that they satisfy Denition 45. (a) N, (b) R, (c) the set of functions from N to the two element set {0, 1} or the powerset N (you choose), (d) every superset of an innite set. (e) Any set which is the target of an injective function whose source is innite. Optional Exercise 19. Show that if a set is not innite then it has a nite number of elements. Exercise 111. Show that if is a set with a nite number of elements then so is its powerset . Do so by determining the number of elements of . 5.2.4 Countable and uncountable sets In computer science we particularly care about sets whose size is at most as big as that of the natural numbers. This is because given a nite number of symbols there are only countably many strings (and so programs) that can expressed using those symbols. Denition 46: countable/uncountable A set is countable if and only if there is an injection from it to the natural numbers. A set is uncountable if and only if there is no injection from it to the natural numbers. A set is countably innite if it is both, countable and innite. Note that every nite set is countable. Examples of countably innite sets are: • The set of natural numbers N. 257 • The set of integers Z. • The set of rational numbers Q. • The set of nite subsets of N, N. • The set of all programs in your favourite programming language. • The set of all strings over a nite alphabet. Examples of uncountable sets are: • The set of real numbers R, • the set of complex numbers C, • the set of all subsets of N, N, • the set of all functions from N to N. Note that the last example, together with the following exercise, illustrates that there are functions from N to N for which we cannot write a computer program! Optional Exercise 20. Assume we have a nite set of symbols, say . (a) Show that is nite for every ∈ N. (b) Show that ⋃︁ ∈N is countable. (c) Show that there is a bijection between the set of nite strings built with symbols from and the set ⋃︀ ∈N , (d) Conclude that there are countably many strings over the alphabet . (e) Put together a set of symbols such that every Python program can be built from those symbols. (f) Prove that there is an injection from the set of Python programs to the set of strings over this set of symbols. (g) Conclude that the set of Python programs is countable. Proposition 5.6 A set is countable if and only if its size is at most that of N. Proof. If a set is countable then by denition of that notion there exists an injection from to N, and by the denition of the size of a set this means that is less than or equal to that of N. Assume that is at most as big as N. Then there is an injection from to N and so is countable. 258 In Chapter 4 the notion of a countable set appears. Indeed, in general, the denition of -algebra should refer to countable sets instead of talking about sets indexed by the natural numbers. In that chapter the notion is avoided as far as possible since the formal denition does not appear until a later chapter. We use this opportunity to connect the two ideas. Proposition 5.7 If is countable then there is a way of listing all its elements, that is, there is a surjective function from N to , allowing us to list all the elements of as 0, 1, 2, 3, . . . and we may think of them as 0, 1, 2, . . . where we delete any repeated elements from the list. If is a set such that there is a surjective function from N to then is countable. Proof. We prove the rst statement. Since is countable there is an injective function : N. We use this function as follows: By Proposition 2.2 there is an injective function : N with the property that ∘ = id . This function is surjective, since given ∈ we know that = id = , so we have found ∈ N which is mapped by to . This completes the proof. To prove the second statement assume we have a surjective function : N . By Exercise 108 this means that the size of is at most the size of N, and by Proposition 5.6 we have completed the proof. Optional Exercise 21. Show that every uncountable set is at least as big as any countable set. Exercise 112. Show that every subset of a countable set is countable. Conclude that every superset of an uncountable set is uncountable. 259 Optional Exercise 22. Show that the following sets do not have the same size. (a) Any set and its powerset; (b) N and R. Conclude that R is not countable. As a consequence of Exercises 20 and 22 we can see, for example, that there are more real numbers than there are Python programs. This means that if we cannot hope to write a Python program that outputs the digits of a given real number, one at a time, for every real number. Optional Exercise 23. Show that any two countably innite sets have the same size. Exercise 113. Give the sizes of the following sets: (a) {, , }, (b) {∅, {∅}, {{∅}}}, (c) the set of regular expressions over the alphabet {0, 1}, (d) the set of nite state machines over the alphabet {0, 1}, (e) the set of regular languages over the alphabet {0, 1}, (f) the set of real numbers in the interval [0, 1]. (g) the set of subsets of the real interval, [0, 1]. 260 Glossary5 -algebra 136 The set of events of a probability space. Contains the whole set of out- comes and is closed under the complement operation and forming unions of countable collections of sets. absolute, |·| 5, 50, 53 Dened for various sets of numbers, here extended to complex numbers. Given a complex number + we have |+ | = √2 + 2. and 63 Connects two properties or statements, both of which are expected to hold. anti-symmetric 394 A binary relation on a set is anti-symmetric if (, ′) and (′, ) both being in the relation implies = ′. argument 53 The argument of a complex number is the angle it encloses with the positive branch of the real axis. associative 80 A binary operation is associative if and only if it gives the same result when applied two three inputs, no matter whether it is rst applied to the rst two, or rst applied to the last two of these. Bayes’s Theorem 161 The equality which says that, given events and , the probability that given is the probability that given , multiplied by the probability of and divided by that of . bijective 104 A function is bijective if and only if it is both, injective and surjective. A bijective function is called a bijection. binary operation 29, 79 A function of the type × , which takes two elements of a set as input and produces another element of . 5Note that entries for concepts from Chapters 6–8 are provided for information and page numbers given here may not match the printed notes for those chapters. 261 binary relation 44 A connection between a source set and a target set which is not neces- sarily a function. It is specied by the collection of all pairs of the form (, ) in × that belong to it. binary tree with labels from a set 294, 699 This is a tree where each node has a label from and where each node has either 0, 1 or 2 children. Formally this is another recursively dened notion. C 48 The complex numbers as a set with a number of operations. coecient 389 The coecients of a polynomial are the numbers that appear as factors in front of a power of the variable. commutative 83 A binary operation is commutative if and only if it gives the same result when its two inputs are swapped. complement 19 The complement of a set is always taken with respect to an underlying set, and it consists of those elements of the underlying set which do not belong to . composite of functions 32 The composite of two functions is dened provided the target of the rst is the source of the second. It is the function resulting from taking an element of the source of the rst function, applying the rst function, and then applying the second to the result. composite of partial functions 337 Similar to the composite of two functions, but the result is undened if either of the two functions is not dened where required. conditional probability 158 Given two events and, where has non-zero probability, the conditional probability of given is the probability of∩ divided by the probability of . conjugate, · 57 The conjugate of a complex number number = + is − . continuous 185 A random variable is continuous if and only if it is not discrete. countable 252 A set is countable if and only if there is an injective function from it to N. countably innite 252 A set is countably innite if it is both, countable and innite. 262 cumulative distribution function (cdf) 198 The cdf of a random variable maps each element of R to the probability that the random variable has a value less than or equal to . denition by cases 41 A way by piecing together functions to give a new function. degree of a polynomial 389 The degree of a polynomial is the largest index whose coecient is unequal to 0. directed graph 340 A set (of nodes) connected by edges; can be described using a binary relation on the set. discrete 185 A random variable is discrete if and only if its range is countable. disjoint 19 Two sets are disjoint if they have no elements in common. disjoint union 18 The union of a family of sets which do not overlap.. div 6 The (integer) quotient of two numbers when using integer division. divides 4, 7 A number divides a number in some set of numbers if there exists a number with the property that = . divisible 4, 7 We say for natural numbers (or integers) that is divisible by if and only if leaves remainder 0 when divided by using integer division. domain of denition 336 For a partial function it is the set consisting of all those elements of the source set for which the partial function is dened. dominate 240 A function from a set to N, Z, Q or R dominates another with the same source and target if and only if the graph of lies entirely above the graph of (graphs touching is allowed). empty set, ∅ 16 A set which has no elements. equivalence class with respect to generated by , [] 373 The set of all elements which are related to by the equivalence relation . 263 equivalence relation 357 A binary relation on a set is an equivalence relation if it is reexive, sym- metric and transitive. equivalence relation generated by a binary relation 359 The transitive closure of the symmetric closure of the reexive closure of . even 4, 7 An integer (or natural number) is even if and only if it is divisible by 2. eventually dominate 242 A function from N to N eventually dominates another with the same source and target if and only if there is some number beyond which the graph of lies above that of (graphs touching is allowed). The analogous denition works for functions with source and target Z, Q or R. expected value 211 The expected value of a random variable can be thought of as the average value it takes. It is given by the integral of the product of a number which the probability that it is the value of the random variable. If the random variable is discrete then this is given by a sum. for all 69 Expresses a statement or property that holds for all the entities specied. full binary tree with labels from a set 285 This is a tree where each node has a label from and where each node has either 0 or 2 children. Formally this is another recursively dened notion. function 30 A function has a source and a target, and contains instructions to turn an element of the source set into an element of the target set. Where partial functions are discussed sometimes known as total function. graph of a function 35 The graph of a function with source and target consists of all those pairs in × which are of the form (, ). greatest element, ⊤ 408 An element which is greater than or equal to every element of the given poset. greatest lower bound, inmum 413 An element of a poset (,≤) is a greatest lower bound of a given subset of that poset if it is both, a lower bound and greater than or equal to every lower bound of the given set. group 90 A set with an associative binary operation which has a unit and in which every element has an inverse. 264 identity function 32 The identity function on a set is a function from that set to itself which returns its input as the output. identity relation 330 The identity relation on a set relates every element of to itself, and to nothing else. if and only if 67 Connects two properties or statement, and it is expected that one holds precisely when the other holds. image of a set, [·] 33 The image of a set consists of the images of all its elements, and one writes [] for the image of the set under the function . image of an element 33 The image of an element under a function is the output of that function for the given element as the input. imaginary part 48 Every complex number + has an imaginary part . implies 66 Connects two properties or statements, and if the rst of these holds then the second is expected to also hold. independent 154, 207 Two events are independent if and only if the probability of their intersection is the product of their probabilities. Two random variables are independent if and only if for every two events it is the case that the probability that the two variables take values in the product of those events is the product of the probabilities that each random variable takes its value in the corresponding event.. innite 248 A set is innite if and only if there is an injection from it to a proper subset. injective 92 A function is injective if and only if the same output can only arise from having the same input. An injective function is called an injection. integer 5 A whole number that may be positive or negative. integer division 3, 6 Integer division is an operation on integers; given two integers and where ̸= 0, we get an integer quotient div and a remainder mod . 265 intersection, ∩ 17 The intersection of two sets and is written as ∩ , and it consists of all the elements of the underlying set that belong to both, and , The symbol ⋂︀ is used for the intersection of a collection number of sets. inverse 88 One element is the inverse for another with respect to a binary operation if and only if when using the two elements as inputs (in either order) to the operation the output is the unit. inverse function 105 A function is the inverse of another if and only if the compose (either way round) to give an identity function. law of total probability 164 A rule that allows us to express the probability of an event from probabilities that split the event up into disjoint parts. least element, ⊥ 408 An element which is less than or equal to every element of the given poset. least upper bound, supremum 413 An element of a poset (,≤) is a least upper bound of a given subset of that poset if it is both, a upper bound and less than or equal to every upper bound of the given set. list over a set 266 A list over a set is a recursively dened concept consisting of an ordered tuple of elements of the given set. lower bound 411 An element of a poset (,≤) is a lower bound for a given subset of if it is less than or equal to every element of that set. maximal element 406 An element which does not have any elements above it. measurable 181 A function from the sample set of a probability space to the real numbers is measurable if and only if for every interval it is the case that the set of all outcomes mapped to that interval is an event. minimal element 406 An element which does not have any elements below it. mod 6 The remainder when using integer division. monoid 88 A set with an associative binary operation which has a unit. 266 multiplication law 161 The equality which says that given events and , the probability of the intersection of and is that of given multiplied with that of . N 1, 306 The natural numbers as a set with a number of operations. This set and its operations are formally dened in Section 6.4. natural number 1 One of the ‘counting numbers’, 0, 1, 2, 3,. . . . odd 4, 7 An integer (or natural number) number is odd if it is not even or, equivalently, if it leaves a remainder of 1 when divided by 2. opposite relation of , op 330 The relation consisting of those pairs (, ) for which (, ) is in . or 65 Connects two properties or statements, at least one of which is expected to hold. ordered binary tree with labels from a set 295 Such a tree is ordered if the set is ordered, and if for every node, all the nodes in the left subtree have a label below that of the current node, while all the nodes in the right subtree have a label above. pairwise disjoint 137 A collection of sets has this property if any two of them have an empty intersection. partial function 335 An assignment where every element of the source set is assigned at most one element of the target set; one may think of this as a function which is undened for some of its inputs. partial order 395 A binary relation on a set is a partial order provided it is reexive, anti- symmetric and transitive. polar coordinates 53 A description for complex numbers based on the absolute and an angle known as the argument.. polynomial equation 13 An equation of the form ∑︀ =0 . 267 polynomial function 37 A function from numbers to numbers whose instruction is of the form is mapped to ∑︀ 1=1 (where the are from the appropriate set of numbers). poset 395 A set with a partial order, also known as a partially ordered set. powerset, 30 The powerset of a set is the set of of all subsets of . prime 76 A natural number or an integer is prime if its dividing a product implies its dividing one of the factors. probability density function 146 A function from some real interval to R+ with the property that its integral over the interval is 1 and whose integral over subintervals always exists. probability distribution 137 A function from the set of events that has the property that the probability of a countable family of pairwise disjoint sets is the sum of the probabilities of its elements. probability mass function (pmf) 197 The pmf of a discrete random variable maps each element of the range of that random variable to the probability that it occurs. probability space 137 A sample set together with a set of events and a probability distribution. product of two sets 27 A way of forming a new set by taking all the ordered pairs whose rst element is from the rst set, and whose second element is from the second set. proper subset 17 A set is a proper subset of the set if and only if is a subset of and there is at least one element of which is not in . Q 9, 389 The set of all rational numbers together with a variety of operations, formally dened in Denition 0.1.3. R+ 12 The set of all real numbers greater than or equal to 0. R 11 The set of all real numbers. 268 random variable 181 A random variable is a measurable function from the set of outcomes of some probability space to the real numbers. range of a function 33 The range of a function is the set of all elements which appear as the output for at least one of the inputs, that is, it is the collection of the images of all the possible inputs. rational number 9 A number is rational if it can be written as the fraction of two integers. A formal denition is given on page 393 (and the preceding pages). real number 11 We do not give a formal denition of the real numbers in this text. real part 48 Every complex number + has a real part . reexive 342 A binary relation on a set is reexive if it relates each element of the set to itself. reexive closure 344 The reexive closure of a binary relation on a set is formed by adding all pairs of the form (, ) to the relation. relational composite 330 A generalization of composition for (partial) functions. remainder for integer division, mod 3 The integer mod is dened to be the remainder left when dividing by in the integers. set dierence, ∖ 19 The set dierence ∖ consists of all those elements of which are not in . size of a set 245, 247 A set is smaller than another if there exists an injective function from the rst to the second. They have the same size if they are both smaller than the other. standard deviation 224 The standard deviation of a random variable is given by the square root of its variance. string over a set 297 A formal word constructed by putting together symbols from . 269 surjective 98 A function is surjective if and only if every element of the target appears as the output for at least one element of the input. This means that the image of the function is the whole target set. A surjective function is called a surjection. symmetric closure 345 The symmetric closure of a binary relation on a set is formed by taking the union of the relation with its opposite. there exists 71 Expresses the fact that a statement or property holds for at least one of the entities specied. total order 399 A total order is a partial order in which every two elements are comparable. transitive 348 A binary relation on a set is transitive provided that (, ′) and (′, ′′) being in the relation implies that (, ′′) is in the relation. transitive closure 349 The transitive closure of a binary relation on a set is formed by adding all pairs of elements (1, ) for which there is a list of elements 1, 2 to in such that (, +1) is in the relation. union, ∪ 17 The union of two sets and is written as ∪ . It consists of all elements of the underlying set that belong to at least one of and . The symbol ⋃︀ is used for the union of a collection of sets. uncountable 252 A set is uncountable if it is not countable. unique existence 72 A more complicated statement requiring the existence of an entity, and the fact that this entities is unique with the properties specied. unit 85 An element of a set is a unit for a binary operation on that set if and only if applying the operation to that, plus any of the other elements, returns that other element. upper bound 411 An element of a poset (,≤) is an upper bound for a given subset of if it is greater than or equal to every element of that set. variance 224 The variance of a random variable with expected value is given by the expected value of the random variable constructed by squaring the result of subtracting from the original random variable. 270 Z 5, 385, 386 The integers with various operations, see Denition 0.1.2 for a formal ac- count. 271 COMP11120, Semester 1 Exercise Sheet 0 (for feedback only) For examples classes in Week 1 Core Exercises for this week CExercise 8 on page 30. CExercise 9 on page 36. CExercise 10 on page 46. Extensional Exercises for this week EExercise 7 on page 25. EExercise 11 on page 47. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. You should make sure this week that you understand the content and in particular the notation used in Chapter 0 272 COMP11120, Semester 1 Exercise Sheet 1 For examples classes in Week 2 Core Exercises for this week CExercise 13 on page 55. CExercise 17 on page 58. CExercise 22 on page 60. Carry out your proof in the style of that given on page 54 as far as you can. Extensional Exercises for this week EExercise 19 on page 59. Carry out your proof in the style of that given on page 54 as far as you can. EExercise 20 on page 60. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could potentially do this week are all those in Chapter 1. 273 COMP11120, Semester 1 Exercise Sheet 2 For examples classes in Week 3 Core Exercises for this week CExercise 25 on page 81. Do three of the parts, one from (a)–(c) and two from (d)–(g). CExercise 27 on page 85. Do three of the parts, one from (a)–(d), one from (e)–(f) and one from (g)–(i). CExercise 28 on page 88. Do two of the parts, one from (a)–(d) and one from (e)–(g). Extensional Exercises for this week EExercise 29 on page 91. Do two of the parts, one from (a)–(b) and one from (c)–(f). EExercise 34 on page 93. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week or those in Sections 2.1 to 2.5. 274 COMP11120, Semester 1 Exercise Sheet 3 For examples classes in Week 4 Core Exercises for this week CExercise 37 on page 99. Do three of the parts, one from (a)–(c), one from (d)–(f). and one from (g)–(i). Hint: If you nd this hard then try to do the previous exercise rst, where you know what the answer is in each case. CExercise 41 on page 106. Do three of the parts, one from (a)–(d), one from (e)–(f), and one from (g)–(h). Hint: If you nd this hard then try to do the previous exercise rst, where yu know what the answer is in each case. CExercise 43 on page 107. Do two of the parts, one from (a)–(c) and one from (d)–(f). Extensional Exercises for this week EExercise 38 on page 100. Do any three parts. EExercise 47 on page 116. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week are those in Section 2.6. 275 COMP11120, Semester 1 Exercise Sheet 7 For examples classes in Week 8 Core Exercises for this week Where the answers are probabilities don’t just give a number, give an expression that explains how you got to that number! CExercise 51 on page 126. Do three of the parts, one from (a)–(d), one from (e)–(f) and one from (g)–(i). CExercise 53 on page 136. CExercise 57 on page 137. Extensional Exercises for this week EExercise 55 on page 137. EExercise 60 on page 141. This is ahead of the lecture material but only requires calculating with sets. It covers important ideas for material to come. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week are those in Section 4.1. 276 COMP11120, Semester 1 Exercise Sheet 8 For examples classes in Week 9 Core Exercises for this week Where the answers are probabilities don’t just give a number, give an expression that explains how you got to that number! CExercise 62 on page 149. Do one from Exercise 51 (a)–(l) and two from Exercises 57 to 59. CExercise 69 on page 163. CExercise 73 on page 170. Extensional Exercises for this week EExercise 63 on page 153. Do any two parts. EExercise 67 on page 163. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week are those in Sections 4.2 to Section 4.3.3. 277 COMP11120, Semester 1 Exercise Sheet 9 For examples classes in Week 10 Core Exercises for this week Where the answers are probabilities don’t just give a number, give an expression that explains how you got to that number! CExercise 77 on page 184. CExercise 83 on page 201. CExercise 84 on page 207. Extensional Exercises for this week EExercise 85 on page 207. EExercise 88 on page 213. Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week are those in Section 2.6. 278 COMP11120, Semester 1 Exercise Sheet 10 For examples classes in Week 11 Core Exercises for this week Where the answers are probabilities don’t just give a number, give an expression that explains how you got to that number! CExercise 90 on page 219. CExercise 102 on page 243. CExercise 104 on page 250. Do one from (a)–(b) and one from (c)–(d). Extensional Exercises for this week EExercise 92 on page 225. EExercise 96 on page 236. Carry out parts (a)–(d). Remember that • if you are stuck on an exercise move on to the next one after ten minutes, but write down why you got stuck so that the GTA can see what you were trying to do, and consider coming back to that exercise again later; • you may only use concepts which are dened in these notes (Chapter 0 establishes concepts for numbers), and for every concept you do use you should nd the denition in the notes and work with that; • you should justify each step in your proofs; • in the examples classes you will nd out more about constructing better solutions; • solutions are published each week and you should study those to improve. Exercises you could do this week are those in Section 2.6. 279 429