[Manuscript] Algorithms and Data Structures


257 61 678KB

English Pages 121 Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

[Manuscript] Algorithms and Data Structures

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Algorithms and Data Structures

Steven J. Rosenberg

Contents Chapter 0. Algorithms: Beginnings 1. Algorithm: a definition . . . . 2. Time- and Space-Complexity 3. The Counting Principle . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

5 5 9 13

Chapter 1. Stacks and Queues 1. Introduction . . . . . . . 2. The Stack ADT . . . . . 3. The Queue ADT . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

19 19 19 23

Chapter 2. Big-O and Asymptotic Complexity . 1. Introduction . . . . . . . . . . . . . . . . . 2. Big-O Notation . . . . . . . . . . . . . . . 3. Asymptotic Complexity . . . . . . . . . . 4. Asymptotic Complexity and Limits . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

25 25 25 27 28

Chapter 3. State Analysis . . 1. Introduction . . . . . . . 2. State Analysis . . . . . . 3. Example: Sudoku Solver

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

29 29 29 35

Chapter 4. Graphs . . . . . 1. Introduction . . . . . . 2. The Graph Data Type 3. Graph Traversal . . . . 4. Applications of Graphs 5. Weighted Graphs . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

39 39 43 45 52 55

Chapter 5. Hashing . . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . 2. Hashing Basics . . . . . . . . . . . . 3. Collision Resolution . . . . . . . . . 4. Proof of Theorem Q . . . . . . . . . 5. Implementation of HashTable in C# 6. Time-Complexity of Hashing . . . . 7. Deletion from a Hash Table . . . . . 8. Alternatives to Probing . . . . . . . 9. Applications of Hashing . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

57 57 59 62 68 72 73 77 78 79

Chapter 6. Information Theory and Data Compression . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81 81

. . . . . .

. . . .

. . . .

. . . .

3

4

CHAPTER 0. CONTENTS

2. 3. 4. 5. 6.

Shannon’s Theory of Information . . . . . Huffman Coding . . . . . . . . . . . . . . Optimality of the Huffman Code . . . . . Connections Between Codes and Entropy Data Compression in Practice . . . . . . .

Chapter 7. Game Strategy . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . 2. Decision Trees . . . . . . . . . . . . . . 3. The Minimax Algorithm . . . . . . . . 4. The Truncated Minimax Algorithm . . 5. Implementing Chess with the Minimax 6. Pruning . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. 83 . 85 . 90 . 93 . 100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

103 103 104 107 112 117 117

CHAPTER 0

Algorithms: Beginnings In this chapter, we answer two basic questions central to this course: (1) What is an algorithm? and (2) How can we measure the efficiency of an algorithm? You are advised to read slowly; pause often to answer questions in the text, and ask questions of your own; and have paper or editor ready to work out everything for yourself. There is nothing extra in this text: everything written here is important; so don’t skip pieces of the text unless you already know the material. 1. Algorithm: a definition In the previous courses CSCI 201 and CSCI 202, you have seen several examples of algorithms. Before reading more, ask yourself: How would you define the word “algorithm” in general? Try to be as careful as you can.

5

6

CHAPTER 0. ALGORITHMS: BEGINNINGS

Definition 0.1 (Algorithm). An algorithm is a finite list of instructions written in a formal language to perform a specific kind of task, which will always terminate (end) after a finite number of instructions have been executed. An algorithm must always complete its task correctly! Remark 0.2. For us, the formal language we use to write an algorithm will usually be a programming language such as C++ or C#. But there are other choices of a formal language: we could use any language with a precise syntax where each instruction has an unambiguous interpretation. In a more advanced course, we would use a mathematical language instead of a programming language. On the other hand, we can also be less formal and write instructions in pseudo-code; this is a popular choice for algorithm descriptions. Remark 0.3. If you are not familiar with the idea of pseudo-code, it means instructions that are written in a language that is in between a natural human language (like English) and a formal programming language. Pseudo-code can be seen as either an informal, unofficial programming language, or as a slightly more formal way to write instructions than typical English. By its very nature, there is no official standard for pseudo-code, or else it would be actual code! Remark 0.4. The least precise part of our definition of algorithm is the word task. Let us try to make the idea of a task more precise before we go farther. Try to think of some examples of tasks in computer science before reading more.

1. ALGORITHM: A DEFINITION

7

Example 0.5 (First Task Example). Add two given integers and return their sum. Example 0.6 (Second Task Example). Count the total number of a’s in a given string, and return this number. Example 0.7 (Third Task Example). Sort a given collection of strings. After looking over these examples, ask yourself, What kind of a thing is a “task” in general?

8

CHAPTER 0. ALGORITHMS: BEGINNINGS

In general, a “task” seems to be something with input and output, which has a single correct output for each possible input. From the point of view of math, then, a task is what we call a function. Remark 0.8. In computer programming, something with input and output is naturally represented by a method. In some programming languages, the word function is actually used instead of the word method. Definition 0.9 (Formal Definition of Task). A task is a function t : I → O, where I is the set of all possible input values, and O is the set of all possible output values. Remark 0.10 (Function Notation). In case you are not familiar with the notation in the definition of task above, we explain it here. We read t : I → O as “t goes from I to O” or “t is a function from I to O.” Here, t is our task (which is a function). I is the domain of t, and O is the range (or codomain) of t. Recall that if x is a value of the domain, then we write x ∈ I, and if y is the output value corresponding to input x, then we write y = t(x), which we read “y equals t of x.” Example 0.11. Formally, the task of adding 2 integers is the function t : Z2 → Z given by the formula t(a, b) = a + b. Recall from C++ that the plus sign + is actually a binary operator, which just means a function that takes 2 inputs of the same type and returns an output of that type. So in fact, we did not need a new symbol t; we could have written + instead! Remark 0.12 (Cartesian Power Notation). In case you are not familiar with the notation Z2 , we explain it here. In general, if S is any set, then we define S 2 to be the set of all ordered pairs of elements of S. That is, S 2 contains all pairs (a, b) where both a and b are elements of S. You may be familiar with this notation for the x, y-plane, where we write R2 = √{(x, y) : x, y ∈ R}. Thus for example we √ have (1, 2) ∈ R2 . We √ think of (1, 2) as a point whose x-coordinate is 1 and whose y-coordinate is 2. The technical description of S 2 is the second Cartesian power of S. It’s really just a practical way of capturing two elements of the same set S in one value (an ordered pair). Example 0.13. Consider the task t of counting the total number of a’s in a given string. Then we have t : S → N, where S is the set of all strings and N is the set of all natural numbers, that is, non-negative integers. If x ∈ S, then we have t(x) = the total number of a’s in x. Next, consider the task of deciding whether a given natural number is prime. Before reading farther, try to write this task formally as a function.

2. TIME- AND SPACE-COMPLEXITY

9

Example 0.14. Let t be the task of deciding whether a given natural number is prime. Then the domain of t is the set N of all natural numbers. The output value of t will be either true or false, that is, one of(the two possible boolean values. true, if x is prime; So we have t : N → {true, false}, and t(x) = false, otherwise. Remark 0.15. The task of deciding whether a given natural number is prime is known as PRIMES. 2. Time- and Space-Complexity Now that we have some understanding of what we mean by an algorithm, we would like to talk about how to measure the efficiency of an algorithm. Efficiency can be expressed by answering two natural questions: (1) How long does the algorithm take to run? We call this the time-complexity of the algorithm; and (2) How much memory does the algorithm need while it runs? We call this the space-complexity of the algorithm. We will focus almost entirely on the first measure of efficiency, the time-complexity. But when we try to say how long an algorithm takes to run, we find that there are (at least) two problems. Think about why we cannot simply say “this sorting algorithm takes 5 milliseconds to run,” for example.

10

CHAPTER 0. ALGORITHMS: BEGINNINGS

One problem with simply asking how long it takes a given algorithm to run is that the runtime can vary depending on the input; we expect it will take longer to sort a large list than a small list, for example. Luckily, we recognize this behavior: once again, something whose output changes depending on input is naturally a function. So our first insight is that time-complexity should be a function, not just a single number. The second problem with asking how long a given algorithm takes to run is that the answer depends on which machine it runs on. We can solve this problem by asking not how much time it takes to run, but instead asking how many instructions are executed during the run; if some instructions are executed multiple times, we keep adding to the total. This is especially important in loops and recursion. So we have reached the following less naive definition of time-complexity: the timecomplexity of an algorithm is the total number of instructions executed, as a function of the input. But this will not be our final version of the definition of timecomplexity. The reason is that the input of our algorithms can come in any form: the input can have any data type; if one algorithm sorts names, and another algorithm sorts numbers, then we would have no way to really compare these two algorithms right now – even if we use the “same” algorithm, let’s say mergesort, in both cases! The problem here is that the name-sorting algorithm takes lists of strings as input, while the number-sorting algorithm takes lists of (say) integers as input. We would like both time-complexities to have the same domain, so that we can compare them! The accepted solution to finding a common domain is as follows. Instead of looking at time-complexity as a function of the actual input to our algorithm, we view time-complexity as a function of the size of the input to our algorithm. By size, we mean the size in memory, i.e., how many bits (or bytes) the input needs to be stored. On our long journey to define “time-complexity,” there is just one more twist before we reach our goal. The problem is that even if two input values have the same size, our algorithm may execute a different total number of instructions. For example, the two numbers a = 2127 −5 and b = 2127 −4 both take 127 bits to store in memory, but it will take much less time to determine whether b is prime than to determine whether a is prime using a loop to test for factors! This is because b is even, while a is prime. We will be interested in the slowest, or “worst-case,” time-complexity, as well as the fastest (best-case), and the average. So our final conclusion will be to define not just one but three kinds of time-complexity. Definition 0.16 (Time-Complexity). Let A be an algorithm. Then the worst-case (best-case, average-case) time complexity of A is the function f : N → N, where f (n) is the maximum (minimum, average) number of instructions executed by the algorithm for all inputs of size n bytes. Remark 0.17. We are usually most interested in the worst-case time-complexity, so that we can give a guarantee that a program will run at least as fast as required. Best-case time-complexity is also of some interest, especially if we can somehow manipulate the input into a best-case form without changing the output of the

2. TIME- AND SPACE-COMPLEXITY

11

algorithm. Average-case time-complexity is very important when we run the algorithm repeatedly on large batches of input; but we will not say much in this course about average-case time-complexity, since it requires some knowledge of probability theory. The exception is Knuth’s classic study of the average-case time-complexity of hashing in Unit 3. Now let’s do some examples of computing the time-complexity of an algorithm. Example 0.18. Let t be the task of counting the total number of a’s in a given string. Let S be the set of all strings (in ASCII format, say). It’s not hard for us to create an algorithm A for this task using the C++ language, by writing a method as follows. unsigned int num_as(string x) { unsigned int count = 0; for (unsigned int i = 0; i < x.length(); i++) { if (x.at(i) == ’a’) count++; } return count; } First we ask, what is the input size in bytes? If we assume that C++ strings truly use ASCII characters, then each character of x uses 1 byte, so the input size is n =x.length(). Next, we must find out how many instructions are executed by this algorithm as a function of n. To do this, it is convenient to number each statement of the code; often, each line of code is a single statement, but not always! For example, the for loop combines 3 statements into one line of code; and other lines contain a single curly brace, { or }, which are not statements at all. Before reading any more, you should try to fill in the last column in the table below. Remember that some of your answers may involve n. statement # statement # times executed 1 unsigned int count = 0 2 unsigned int i = 0 3 i < x.length() 4 i++ 5 x.at(i) == ’a’ 6 count++ 7 return count

12

CHAPTER 0. ALGORITHMS: BEGINNINGS

Below is the completed table: statement # statement # times executed 1 unsigned int count = 0 1 2 unsigned int i = 0 1 3 i < x.length() n+1 4 i++ n 5 x.at(i) == ’a’ n 6 count++ 0 to n 7 return count 1 There are some important things to note in this table: (1) Not surprisingly, statements inside of a loop can be repeated as many times as the loop iterates. In this loop, the number of iterations is n, the length of the string x. (2) Statement #2 is actually outside of the for loop, and only executed once. Remember that the loop circles back to the second statement in the loop header, which is our statement #3. (3) Statement #3 is executed n + 1 times, not just n times. This accounts for the final pass through the loop, when the loop condition is false. (4) Statement #6 is executed as many times as there are a’s in x. This is an example of a situation where two different input values (our x’s) with the same size can produce different run times. In this example, we say that the worst-case scenario occurs when the input x is entirely made of a’s, and the best-case scenario occurs when x contains no a’s. (5) Notice that we count statement #3 and statement #5 as instructions, even though these pieces of code are only conditions and “don’t do anything”. In fact, they do something: they evaluate boolean expressions. Moreover, many computer scientists only count conditional expressions when computing time-complexity, since they determine the flow of the program. Finally, we add up the values in the last column to get the total number of instructions executed, including multiplicity. We find that the maximum is 4n + 4 and the minimum is 3n + 4. We conclude that the worst-case time-complexity (WCTC) of algorithm A is f (n) = 4n + 4 and the best-case time-complexity of A is g(n) = 3n + 4. It is sometimes also helpful to graph a time-complexity. This is possible since we defined a time-complexity to be a function from N to N (imagine how difficult it would be to make a graph of f if the domain of f contained strings instead of numbers!). In this example, we get linear functions:

3. THE COUNTING PRINCIPLE

y

13

y = f (n) y = g(n)

4 n Notice that the worst-case time-complexity function is always at or above the bestcase, as it should be. Remark 0.19. A common mistake when a beginning student is asked for the bestcase scenario in Example 0.18 is to answer “The best case occurs when n = 0”. This answer shows thinking that is partly correct but with a fatal mistake. It is correct that the best case means the case that has the smallest number of instructions executed. Since y = 4 is the lowest point on the graph, does this mean that the best case occurs when n = 0? No! We must remember that time-complexity is a FUNCTION, not a number. For every possible input size, there is a best-case scenario. We cannot limit the input size when describing the best case! It is correct to say that the best-case scenario occurs when the input string contains no a’s, but not correct to say that the best-case scenario occurs when the input string contains no characters. Remark 0.20. Notice that the code in Example 0.18 contains method calls: namely, a call to the length method of the string class in statement #3, and a call to the at method in statement #5. We know as programmers that it is possible for a method to be long and complicated, but in our analysis of this example, we assumed that each of these methods only needed 1 instruction to execute. In this example, the method calls are (probably) very short, so our assumption is justifiable. But in general, we should be careful to ask about the time-complexity of every method call within an algorithm when analyzing time-complexity. This can be especially tricky in a language such as C++, where method calls can be “hidden” (implicit), as in calls to operator overloads. Remark 0.21. A very careful reader may still complain about our work in Example 0.18. When we computed time-complexity and drew the graphs, we allowed the input size n to be any natural number; in particular, n could be as large as we like. But in our code, we used the built-in data type unsigned int to store numbers up to n, even though we know that this data type has a fixed amount of storage space (usually 4 bytes). So to be very precise, our analysis is only good for input sizes n < 232 . If we want to make our code completely correct, then we should use an unlimited-precision type (such as the BigInteger type in Java) to store our numbers. 3. The Counting Principle In this section, we introduce a simple technique that allows us to count the number of ways to make certain choices, or the number of elements in certain sets. Although

14

CHAPTER 0. ALGORITHMS: BEGINNINGS

the principle is simple, it can be tricky to apply the principle correctly in a complex situation. Lemma 0.22 (The Counting Principle, also known as the Multiplication Rule). Suppose that we have n choices to make, and for the ith choice we have ti different options. Then the total number of possible outcomes is equal to the product t1 · t2 · · · · · tn . Example 0.23. When n = 2, let a = t1 and b = t2 . Then the Counting Principle says that the number of ways to choose one option out of a possibilites, followed by choosing one option out of b possibilities, is just equal to the product a · b. This really is the same principle that we can use to define the concept of multiplication of numbers in the first place: if you have 5 sets of 4 things each, then you have 5 · 4 = 20 things total! Example 0.24. Let v1 , v2 , . . . , vn be n distinct things. How many different orders are there to list these things? Consider this problem as a series of n choices, where we can choose any of the remaining things in each choice. For our first choice, there are all n things as options; for our second choice, there are n − 1 things left as options; and so on, until our last choice, when there is only 1 thing left to choose. So the total number of lists we can make with n things is equal to n · (n − 1) · (n − 2) · · · · · 3 · 2 · 1 = n!. Example 0.25. Let us count the number of non-negative whole numbers that can be written with at most d digits, using base b. Recall that in base b, we are allowed to use digits representing the numbers 0, 1, 2, . . . , (b − 1). In any case, there are b possible values for each digit. Before reading on, you should try to answer our question: how many base-b numbers can be written if you can use up to d digits?

3. THE COUNTING PRINCIPLE

15

If you answered b · d, you are not alone, but sadly, you are incorrect. Let’s think about this question as a series of choices. Each digit involves a choice among b different options. There are d choices to make. So the total number of possible outcomes is equal to the product b · b · b · · · · · b, where there are d factors. This is just bd . This answer makes sense if you are familiar with the usual base 10. There are, for example, 1000 different non-negative whole numbers that we can write in base 10 using up to 3 digits: they are 0, 1, 2, 3, 4, . . . , 999.

Example 0.26. Let’s analyze the time-complexity of the BubbleSort algorithm. Recall that the input of a sorting algorithm is a collection, and the output is the same collection, but sorted. In our code, we use the vector class from the C++ Standard Template Library to store our collection. We assume that type T supports the > operator for comparing two objects.

template void BubbleSort(vector & toSort) { int len = toSort.size(); for (int i = 0; i < len; i++) { for (int j = 0; j < len - 1; j++) { if (toSort[j] > toSort[j + 1]) { T temp = toSort[j]; toSort[j] = toSort[j + 1]; toSort[j + 1] = temp; } } } }

Here, the input is the collection called toSort. What is the input size? How much memory does it take to store this collection? In C++, the answer is toSort.size() * sizeof(T). For convenience, let us assume that type T objects take 1 byte of memory, so that the input size is n = len = toSort.size(). Before going farther, the reader should try to complete the table below:

16

CHAPTER 0. ALGORITHMS: BEGINNINGS

statement # statement # times executed 1 int len = toSort.size() 2 int i = 0 3 i < len 4 i++ 5 int j = 0 6 j < len - 1 7 j++ 8 toSort[j] > toSort[j + 1] 9 T temp = toSort[j] 10 toSort[j] = toSort[j + 1] 11 toSort[j + 1] = temp

3. THE COUNTING PRINCIPLE

17

Below is the completed table: statement # statement 1 int len = toSort.size() 2 int i = 0 3 i < len 4 i++ 5 int j = 0 6 j < len - 1 7 j++ 8 toSort[j] > toSort[j + 1] 9 T temp = toSort[j] 10 toSort[j] = toSort[j + 1] 11 toSort[j + 1] = temp

# times executed 1 1 n+1 n n n·n n · (n − 1) n · (n − 1) 0 to n · (n − 1) 0 to n · (n − 1) 0 to n · (n − 1)

Note that we used a form of the Counting Principle to count the total number of executions of statements in the inner loop: everything gets multiplied by n because of the outer loop. From the table, we find that the worst-case time-complexity is f (n) = 6n2 − 2n + 3 and the best-case time-complexity is g(n) = 3n2 + n + 3. Both of these functions are quadratic, so their graphs are parabolas. Before reading farther, you should try to describe the best-case and worst-case scenarios for the BubbleSort algorithm. Your answers should only involve properties of the input collection toSort, and NOT involve any of the details of how the algorithm works!

18

CHAPTER 0. ALGORITHMS: BEGINNINGS

The best-case scenario occurs when the input collection is already sorted before it is passed to the BubbleSort algorithm. The worst-case scenario occurs when the input collection comes in the exact reverse of the desired sort order: that is, when the collection starts out sorted in descending order. Note that we do not want to say “the best-case scenario occurs when statements #9, 10, and 11 are never executed” as our final answer. Even though this is true, it does not describe the situation in terms of how the input looks.

CHAPTER 1

Stacks and Queues 1. Introduction Recall that the most important feature of object-oriented programming (OOP) is the ability to create user-defined data types. In the C family of languages, including C++, Java, and C#, we implement a data type using a class. We distinguish between an abstract data type (ADT) and a class. An ADT exists at the conceptual level, and does not have any actual code tied to it. A class is an implementation of an ADT, including all necessary code; the same ADT may have many different classes that implement it, and may even be implemented in many different programming languages. We divide an ADT into two parts: a set of attributes, which describe the data stored by the ADT; and a set of operations, which describe what the ADT can do (the actions or “verbs”). When we implement an ADT using a class, the attributes become instance variables, and the operations become methods.

2. The Stack ADT A stack is a collection of things piled on top of each other: imagine a stack of books, for example. Unlike an array, a stack is not indexed: we cannot ask for, say, the 45th element of a stack. A stack may only be modified or viewed from the top: we don’t want to try to pull out a book from the middle of our pile! We use the following table to describe the stack ADT: Attributes ◦ a collection of items, stored from top to bottom

◦ ◦ ◦ ◦ ◦

Operations push an item onto the top pop an item off the top peek at the top item without changing it determine whether the stack is empty empty the stack

Remark 1.1. To push just means to add one more item to a stack; to pop means to remove the top item from the stack; and to peek means to look at. These three words are peculiar to the stack data type. We like to picture a stack as a vertical pile of cells or “boxes”, which grows taller with each push, and shorter with each pop. Each cell contains one item. Consistent with this view, we say that an item is “on” a stack instead of in a stack; and we talk about the “height” of a stack 19

20

CHAPTER 1. STACKS AND QUEUES

instead of the size or length. So the height of a stack means the number of items on the stack. Remark 1.2. Sometimes beginning programmers want to add more operations to a stack, such as the ability to peek at any item in the stack, for example the 45th item below the top. These programmers are usually thinking either that they are improving the stack type by expanding what it can do, or else that they need to add this ability in order to solve a particular programming problem that they have. Both reasons are bad. The stack ADT is defined by the table above, which tells us both the abilities and the limitations of a stack. If we add more abilities, then we no longer have a stack! And if we need more abilities to do a certain job, then we should either work within the limitations of the stack, or else use a different data type. The same principle holds for any ADT. Next we give some examples of how stacks are used. Example 1.3 (The Call Stack). In many modern programming languages, including C++ and C#, a stack is used to keep track of method calls. This stack is known as the call stack. The call stack starts out empty. Every time a method is called, the current execution point (the point where the code is currently running) is pushed to the call stack, along with other information. Every time a method returns, the call stack is popped, and the runtime environment gives control back to the appropriate execution point in the code. If you are not familiar with the call stack and its use in Visual Studio development, you should watch the video “Call Stack in Visual Studio debugging”. Example 1.4 (Delimiter Matching). A delimiter is a symbol (or string of symbols) used to mark the boundaries of (or “delimit”) a region of code or text. You are already familiar with many delimiters from Java and C++, which will be used similarly in C#. Many delimiters come in pairs, with an opening delimiter and a corresponding closing delimiter; examples of paired delimiters are: curly braces: { and } parentheses: ( and ) square brackets: [ and ] multi-line comment markers: /∗ and ∗/ Some delimiters, such as ", use the same symbols for both opening and closing. Finally, there are delimiters like // which marks the beginning of a single-line comment, but has no corresponding closer. When a compiler (or interpreter) is run against source code, one of the first tasks it does is to scan the code and identify the various logical units in it, such as classes, methods, variables, keywords, etc.; this task is known as parsing. An important part of parsing is identifying regions enclosed by corresponding opening and closing delimiters. In order for code to be parsable – to even make sense, to have correct syntax – it is necessary for every opening delimiter to be matched by the correct closing delimiter. For example, consider the code fragment below: int foo(int x)

2. THE STACK ADT

21

{ for(int i = 0; i < 4; i++) { if(i*i == x) return i; return -1; } Here, the opening curly brace of the for loop is unmatched, so this code won’t compile. (But note that we could equally well say that it is the opening curly brace of the foo method that is unmatched! It is only the spacing which suggests otherwise.) We can use a stack to help test whether all delimiters match in a given string of code or text. See the code in the file testDelimiters.cpp for an implementation of a stack-based delimiter matching algorithm in C++ in the method delimitersMatch. Let’s run through an example of this algorithm manually now. Consider the string s = "{x = ((3 + 5) * 9 - 5 ) / 2; }" Before reading farther, you should try to step through the delimitersMatch method by hand and write down what happens at each step: how does the stack look, when do variables change their values, etc.?

22

CHAPTER 1. STACKS AND QUEUES

We now analyze the code in the delimitersMatch method. The input string is scanned one character at a time. Any character that is not a delimiter is simply ignored. When an opening delimiter is encountered, it is pushed to the stack. When a closing delimiter is encountered, the top character of the stack is popped (if possible) and the two characters are compared to see whether they correspond. If a mismatch occurs during the loop, or if the loop ends with some openers still in the stack, the method returns false; otherwise, it returns true. Here is how the stack looks as the method runs against the input string s = "{x = ((3 + 5) * 9 - 5 ) / 2; }": (

{

(

(

(

{

{

{

{

Note that the line segments at the start and end of the run represent an empty stack. If you carefully step through the code one statement at a time, you will see that every time a closing delimiter is encountered, the stack contains the corresponding opening delimiter at the top. When the loop ends, the stack is empty. Therefore, this string passes the test for its delimiters to match.

3. THE QUEUE ADT

23

3. The Queue ADT In England, a “queue” (pronounced “cue”, the same as how you would pronounce the letter Q) is a waiting line: imagine a line at a fast-food restaurant or at a cafeteria food station. Functionally, the big difference between a queue and a stack is that with a stack, the item most recently added is the item that can be removed next (a stack is a last-in, first-out data structure, or LIFO); but with a queue, the item that was added longest ago is the one that can be removed next (a queue is first-in, first-out or FIFO). This makes sense because the person who has waited longest in a queue deserves to be served next. The other differences between the Queue and Stack data types are merely conventions of naming and visual representation. By convention, we represent a queue horizontally (whereas a stack is drawn vertically). We also say enqueue instead of push, and we say dequeue instead of pop. We use the following table to describe the queue ADT: Attributes ◦ a collection of items, stored from front to back

◦ ◦ ◦ ◦ ◦

Operations enqueue an item to the back of the queue dequeue an item from the front of the queue peek at the front item without changing it determine whether the queue is empty empty the queue

CHAPTER 2

Big-O and Asymptotic Complexity 1. Introduction We have seen that “time-complexity” does not actually measure time, but rather the total number of instructions executed as an algorithm runs. But unless we are programming directly in the native machine language of our computer, it is usually not possible to say exactly how many instructions are actually going to be executed by our code: this is because our source code will be compiled down to a lower-level language before it is run. Furthermore, if we are really interested in the running time of our algorithm, then there are more complications. For example, different instructions, even in machine language, will each take a different amount of time (number of clock cycles) to execute; and of course, different machines run at different speeds. Maybe one machine runs twice as fast as another; maybe the slowest machinelanguage instruction takes 18 clock cycles while the fastest takes only 1 clock cycle; and maybe each statement in our source code compiles into between 1 and 20 machine-language instructions. In each case, estimating the running time by the total number of instructions executed will be wrong by at most a constant factor: a factor of 2, or 18, or 20, in our examples above. For this reason, computer scientists employ a way to talk about time-complexities that “smooths out” any difference between two functions that only differ by at most a constant factor. The language we use to do this includes “Big-O” notation, “Big-Theta” notation, and “asymptotic time-complexity”. These ideas are related to each other, and allow us to see the big picture when talking about the time-complexity of an algorithm.

2. Big-O Notation The first concept we study in this chapter will allow us to compare two functions and say whether one function grows no faster than the other. We most often apply this to time-complexity functions, but the definitions we give are more general. Definition 2.1. Let f and g be functions whose domain and range consist of non-negative numbers. We write f = O(g) if there exist positive constants c and N such that f (x) ≤ c · g(x) for all x ≥ N . We read this as “f is big oh of g”, and sometimes say that f grows no faster than g. 25

26

CHAPTER 2. BIG-O AND ASYMPTOTIC COMPLEXITY

Remark 2.2. c is a constant “scaling factor” that does not depend on x. N is a threshold value, beyond which f is at most g scaled by c. To say f = O(g) means that eventually, f (x) is no bigger than a constant multiple of g(x); here “eventually” means for all x ≥ N , and the constant in question is c. Remark 2.3. Some people do not like the notation “f = O(g)” because the equals sign is misleading: f is a function, but “O(g)” does not represent a single function, so they are not truly equal. These people prefer to write “f is O(g)” to avoid the illusion of equality. To be totally accurate, we could define O(g) to be the set of all functions which grow no faster than g, and then write f ∈ O(g) instead of f = O(g). Example 2.4. We claim that 4x + 4 = O(x). To show this, let f (x) = 4x + 4 and g(x) = x. We must find positive constants c and N so that 4x + 4 ≤ cx whenever x ≥ N . Since we want cx to be bigger than 4x + 4, we choose c = 5 (remember, we are allowed to choose any positive number c if we can show that it works!). Then we solve 4x + 4 ≤ 5x to find x ≥ 4; thus, we choose N = 4. Notice that even though 4x + 4 is much bigger than x when x is large, they only differ by (roughly) a constant factor. We can now say that 4x + 4 grows no faster than x. Example 2.5. Let f (x) = x and let g(x) = x2 . The reader should try to decide whether f = O(g), and also whether g = O(f ), before reading more.

3. ASYMPTOTIC COMPLEXITY

27

We can see that in fact x = O(x2 ) by choosing c = 1 and N = 1 in the definition of big-O notation. But we claim that x2 ̸= O(x). This is because no matter what positive value we choose for c, we will have x2 > cx whenever x > c, so it is not possible that for all large enough x we will have x2 ≤ cx. 3. Asymptotic Complexity Next, we look at an important special case of big-O relationships, when two functions are big-O of each other. Definition 2.6 (Asymptotic Complexity). We say that two functions f and g have the same asymptotic complexity, or that f and g grow at the same rate, if both f = O(g) and g = O(f ). The notation for this is f = Θ(g) or, since the situation is symmetric, g = Θ(f ). (Note: Θ is the Greek letter theta, in capitalized form.) The asymptotic complexity class of f is the set of all functions g such that g = Θ(f ). Example 2.7. We saw in Example 2.4 that 4x + 4 = O(x). It is also true that x = O(4x + 4), as we can see by taking c = 1 and N = 1 in the definition of big-O (for example). Therefore, 4x + 4 = Θ(x); the two functions x and 4x + 4 are in the same asymptotic complexity class. In fact, any linear function ax + b where a > 0 is in this same class. This illustrates how the idea of asymptotic complexity simplifies the world of functions for us: the simple function y = x represents all linear functions, asymptotically. In general, if we have any non-negative polynomial function f , then the asymptotic complexity class of f is determined by the degree of f alone. Just as ax + b = Θ(x) (assuming a > 0), we also have ax2 + bx + c = Θ(x2 ), and similarly for degree-3 polynomials, and so on. This is because the leading term of a polynomial is much bigger than all of the other terms when x is large enough; we say that the leading term dominates the function. Even more generally, given a function f , we would like to find the “simplest” function in the same asymptotic complexity class as f , in order to give us a good idea of how fast f (x) grows as x → ∞. In the following sequence, we list some of the most common representative functions for asymptotic complexity classes which occur in computer science. Because time-complexity functions have the natural numbers as their domain, we switch our choice of variable from x to n, as is traditional. 1 0 and P (B) > 0. Then we have probability spaces XA and XB as in Definition 6.9. Let W = {A, B} with probabilities P (A) and P (B). Let cA and cB be the binary codes obtained from the trees L and R.

5. CONNECTIONS BETWEEN CODES AND ENTROPY

95

Now we have X len(c) = P (m) · len(c(m)) m∈S

=

X

P (m) · len(c(m)) +

m∈A

=

X

X

P (m) · len(c(m))

m∈B

P (m) · len(“0” + cA (m)) +

m∈A

=1 +

X

X

P (m) · len(“1” + cB (m))

m∈B

P (m) · len(cA (m)) +

m∈A

=1 + P (A) ·

X

P (m) · len(cB (m))

m∈B

X

PA (m) · len(cA (m)) + P (B) ·

m∈A

X

PB (m) · len(cB (m))

m∈B

=1 + P (A) · len(cA ) + P (B) · len(cB ) ≥1 + P (A) · H(A) + P (B) · H(B) ≥H(W ) + P (A) · H(A) + P (B) · H(B) =H(X).

(by inductive hypothesis) (by Lemma 6.31) (by Entropy Axiom 3) □

Remark 6.33. We emphasize that in Proposition 6.32, c can be any binary code with the prefix property, not necessarily the Huffman code. Thus, the entropy is an ultimate universal bound on the efficiency of any proper binary code. To prove our next big result, we need some lemmas: one involves the procedure in Huffman’s algorithm, and the others are technical math results. Lemma 6.34. Suppose that x and y are two distinct non-leaf nodes in the Huffman tree TH . Let a and b be the root probabilities of x’s left and right child, respectively; let c and d be the root probabilities of y’s left and right child. Then the open intervals (a, b) and (c, d) are disjoint. Proof. The vertex x must have been created as the root node of a new tree T obtained by merging two trees t and t′ in Step 3 of Huffman’s algorithm; similarly for y, at a different iteration of the algorithm. We must have a ≤ b and c ≤ d because of the instructions in Steps 2 and 3 of Huffman’s algorithm. Without loss of generality, x was created before y. It follows that y’s left child, or one of the nodes below y’s left child in TH , must have been the root of some tree τ in the collection T which existed at the time that x was created. If c < b, then certainly P (τ.root) < b also, so we would have chosen τ instead of t′ at the point when we merged t and t′ . This contradiction shows that b ≤ c. So a ≤ b ≤ c ≤ d, and the result follows. □ Lemma 6.35. Let 0 ≤ a ≤ b ≤ 1. Then      a a b b c := (a + b) · 1 + log2 + log2 ≤ b − a. a+b a+b a+b a+b

96

CHAPTER 6. INFORMATION THEORY AND DATA COMPRESSION

Proof. We have 

   a b + b · log2 − (b − a) a+b a+b =2a + a · log2 (a) + b · log2 (b) − (a + b) log2 (a + b).

c − (b − a) =(a + b) + a · log2

Let f (x) = 2a + a · log2 (a) + x · log2 (x) − (a + x) log2 (a + x) for x ∈ [a, 1]. Then we must show that f (b) ≤ 0. We have f (a) =2a + a · log2 (a) + a · log2 (a) − 2a · log2 (2a) =2a + 2a · log2 (a) − 2a · log2 (2a) =2a + 2a · log2 (a) − 2a · log2 (a) − 2a · log2 (2) =2a − 2a · log2 (2) =2a − 2a =0.

f ′ (x) = log2 (x) + 1/ ln(2) − log2 (a + x) − 1/ ln(2) = log2 (x) − log2 (a + x)   x . = log2 a+x We have f ′′ (x) = 1/x − 1/(a + x) = a/(a + x) ≥ 0 for all x ∈ [a, 1], so f ′ is non-decreasing in [a, 1]. f ′ (1) = log2 (1/(1 + a)) ≤ log2 (1) = 0, so f ′ (x) ≤ 0 for all x ∈ [a, 1]. Since f (a) = 0, this gives f (x) ≤ 0 for all x ∈ [a, 1]. In particular, we have f (b) ≤ 0, as desired. □ Lemma 6.36. Let c be the Huffman code for a finite probability space X = (S, P ). Let TH be the Huffman code for X. Let ∆(X) = len(c) − H(X). For real numbers a, b with 0 ≤ a ≤ b ≤ 1, let      a b a b g(a, b) = (a + b) · 1 + log2 + log2 . a+b a+b a+b a+b Then we have ∆(X) =

X

g(P (x.leftChild), P (x.rightChild)).

x in TH x:non−leaf

Proof. We induct on |S|. Let r be the root of TH , and let A and B be the set of messages in the left and right subtrees L and R of TH . Let cA and cB be the binary codes obtained from L and R, respectively. Let a = P (r.leftChild) = P (A) and b = P (r.rightChild) = P (B). Let XA = (A, PA ) and XB = (B, PB ) be the conditioned probability spaces, and W = {A, B}. Base case: |S| = 2. Then |A| = |B| = 1, and both codewords have length 1. So we have len(c) = 1. Thus ∆(X) = 1 − H(X) = 1 + a · log2 (a) + b · log2 (b) = g(a, b), since a + b = 1. Since the root of TH is the only non-leaf, this is what we wanted to show.

5. CONNECTIONS BETWEEN CODES AND ENTROPY

97

Inductive step: By Entropy Axiom 3, we have H(X) = H(W ) + a · H(XA ) + b · H(XB ). As in the proof of Proposition 6.32, we have len(c) = 1 + a · len(cA ) + b · len(cB ). Note that PA (x) = P (x)/a for any node x in L, so we have P (x)/(P (x) + P (y)) = PA (x)/(PA (x) + PA (y)) for any nodes x, y in L; therefore g(P (x), P (y)) = a · g(PA (x), PA (y));

(6.2)

and likewise with A replaced by B (and L replaced by R). So ∆(X) =len(c) − H(X) =1 + a · len(cA ) + b · len(cB ) − H(W ) − a · H(XA ) − b · H(XB ) =1 − H(W ) + a · ∆(XA ) + b · ∆(XB ) X =1 − H(W ) + a · g(PA (x.leftChild), PA (x.rightChild)) x in L x:non−leaf

+b ·

X

g(PB (x.leftChild), PB (x.rightChild))

x in R x:non−leaf

(by inductive hypothesis) =1 − H(W ) +

X

g(P (x.leftChild), P (x.rightChild))

x in L x:non−leaf

+

X

g(P (x.leftChild), P (x.rightChild))

(by Equation 6.2)

x in R x:non−leaf

=1 − H(W ) +

X

g(P (x.leftChild), P (x.rightChild))

x in TH x:non−leaf x̸=r

=1 + a · log2 (a) + b · log2 (b) +

X

g(P (x.leftChild), P (x.rightChild))

x in TH x:non−leaf x̸=r

=g(P (r.leftChild), P (r.rightChild)) +

X

g(P (x.leftChild), P (x.rightChild))

x in TH x:non−leaf x̸=r

(as in the Base case) =

X

g(P (x.leftChild), P (x.rightChild)).

x in TH x:non−leaf

This completes the induction.



Theorem 6.37. Let c be the Huffman code for X = (S, P ), where |S| ≥ 2. Then we have H(X) ≤ len(c) ≤ H(X) + 1. Proof. We know that c has the prefix property by Proposition 6.25, so H(X) ≤ len(c) by Proposition 6.32.

98

CHAPTER 6. INFORMATION THEORY AND DATA COMPRESSION

For the other inequality, let ∆(X) = len(c) − H(X), as in Lemma 6.36. Then we have

X

∆(X) =

g(P (x.leftChild), P (x.rightChild))

(by Lemma 6.36)

[P (x.rightChild) − P (x.leftChild)]

(by Lemma 6.35)

x∈S x:non−leaf

X



x∈S x:non−leaf

≤ 1.

(by Lemma 6.34)

The last inequality follows since all of the probabilities are between 0 and 1 and the intervals (P (x.leftChild), P (x.rightChild)) do not overlap. This completes the proof. □

At the start of this chapter, we discussed block transmission of messages. Now that we are dealing with probabilities, it is natural to ask how we should model block transmission of messages chosen from a finite probability space. The usual convention is to assume that the messages are independent of each other. Recall from Definition 5.32 that two events are independent if the probability that both happen is the product of their individual probabilities.

Lemma 6.38. Let X = (S, P ) be a finite probability space, and let X n be the probability space modeling a block transmission of n independent messages from S. Then we have H(X n ) = n · H(X).

Proof. We have X n = (S n , P n ), where as usual S n is the nth Cartesian power of S, and where P n (a1 , a2 , . . . , an ) = P (a1 )·P (a2 )·· · ··P (an ) for a1 , a2 , . . . , an ∈ S. We prove the case n = 2; the higher cases don’t use any extra ideas, but are messier

5. CONNECTIONS BETWEEN CODES AND ENTROPY

99

to write down. We have X P 2 (a1 , a2 ) · log2 (P 2 (a1 , a2 )) H(X 2 ) = − (a1 ,a2 )∈S 2

=−

X

P (a1 ) · P (a2 ) · log2 (P (a1 ) · P (a2 ))

(a1 ,a2 )∈S 2

=−

X (a1 ,a2

=−

P (a1 ) · P (a2 ) · [log2 (P (a1 )) + log2 (P (a2 ))]

)∈S 2

X (a1 ,a2 )∈S 2

=−

X X

X

P (a1 ) · P (a2 ) · log2 (P (a1 )) −

P (a1 ) · P (a2 ) · log2 (P (a2 ))

(a1 ,a2 )∈S 2

P (a1 ) · P (a2 ) · log2 (P (a1 )) −

a1 ∈S a2 ∈S

X X

P (a1 ) · P (a2 ) · log2 (P (a2 ))

a1 ∈S a2 ∈S

! =−

X

P (a1 ) · log2 (P (a1 )) ·

a1 ∈S

=−

X

X

P (a2 )

! +

a2 ∈S

P (a1 ) · log2 (P (a1 )) · 1 +

a1 ∈S

=H(X) + H(X) ·

X

X

P (a1 ) ·

a1 ∈S

X

−P (a2 ) · log2 (P (a2 ))

a2 ∈S

P (a1 ) · H(X)

a1 ∈S

X

P (a1 )

a1 ∈S

=H(X) + H(X) · 1 =2 · H(X), as desired.



Remark 6.39. For those readers who are familiar with probability theory, there is a simpler proof of Lemma 6.38. We note first that H(X) is just the expected value of the random variable Y given by Y (a) = − log2 (P (a)) for a ∈ S. Substituting the relevant terms for X n , we see that H(X n ) is the expected value of the random variable Z given by Z(a1 , . . . , an ) = − log2 (P (a1 ) · · · · · P (an )). In X n if we let Zi (a1 , . . . , an ) = − log2 (P (ai )), we have Z = Z1 + · · · + Zn , so H(X n ) = E(Z) = E(Z1 + · · · + Zn ) = E(Z1 ) + · · · + E(Zn ) = E(X) + · · · + E(X) = n · E(X). The final result of this section gives another relationship between codes and entropy. Theorem 6.40. Let X be a finite probability space. For each positive integer n, let cn be an optimal proper binary code for X n , in the sense that the average codeword length of cn is minimal. Then we have lim len(cn )/n = H(X).

n→∞

That is, the entropy of X is the limit of the average number of bits per message needed to transmit a block of messages from X, as the block size goes to infinity. Proof. We have H(X n ) ≤ len(cn ) by Theorem 6.37, and len(cn ) ≤ H(X n )+1 by Proposition 6.32. Using Lemma 6.38, we find n · H(X) ≤ len(cn ) ≤ n · H(X) + 1, and so H(X) ≤ len(cn )/n ≤ H(X) + 1/n. Now the Squeeze Theorem gives the desired result. □

100

CHAPTER 6. INFORMATION THEORY AND DATA COMPRESSION

We now have three different ways to describe entropy: first, as the unique solution to the Entropy Axioms; second, by the explicit formula 6.1 in Shannon’s Theorem (Theorem 6.12); and third, as the limit of the average number of bits per message in optimal codes for block transmissions as the block size approaches infinity. Each of these three descriptions is a complete characterization of Shannon’s entropy, and each is important to understand. 6. Data Compression in Practice We now know quite a bit of information theory, but not so much how to apply it in a real situation. As of now, in order to compress data, we need to start with a finite probability space. This is fine if we are compressing messages about whether your friend won the lottery, where we know the probabilities of the outcomes in advance, but what if we want to compress some ordinary text file? We need to get from the input text file to a suitable probability space so that we can apply Huffman’s algorithm. One natural way to get probabilities from text is by computing the frequency that each message actually occurs in the text. For example, if we choose our message unit to be a single character, then we would compute the probability of the message ‘e’ to be P(‘e’) = (total number of e’s in the text) / (total number of characters in the text). After computing probabilities, we can apply Huffman’s algorithm, and then write the compressed output file by concatenating the codewords corresponding to the “plainwords”, as we will call the original messages in the input text file. But there is another task to be done if we want to be able to decompress: that is, translate the output file back into the original file. Namely, we also need to write the code itself in the output file. This “code dictionary” should come at the start of the output file, and needs to list all of the plainword/codeword pairs, together with whatever other technical information is needed to ensure that decompression is possible. The following algorithms accomplish the tasks we have just described. Algorithm 6.3 (Compression Algorithm). Input: A file Fplain; A message unit U (for example, char or word). Output: A file Fcode Procedure: 1. Parse Fplain into individual message units: Fplain = a1 + a2 + a3 + ... + an, where + means concatenation. Set S = {a1, a2, a3, ... an}. 2. Compute the probability P(a) for each message a in S. 3. Run Huffman’s algorithm on the space X=(S, P) to get a code c. 4. Write the output file Fcode = D + c(a1) + c(a2) + c(a3) + ... + c(an), where D is the code dictionary.

6. DATA COMPRESSION IN PRACTICE

101

Algorithm 6.4 (Decompression Algorithm). Input: A file Fcode Output: A file Fplain Procedure: 1. Parse the code dictionary D from Fcode to get a message set S and code c on S. 2. Use the Decoding Algorithm (Algorithm 6.1) to write the output file Fplain. Remark 6.41. We have written the previous two algorithms at a very abstract level, without giving many details. In particular, we have not specified the format of the code dictionary D. In Lab 4, you are not required to write D, but if you choose to do so, you can find details in the Lab 4 assignment. Remark 6.42. In most file systems in use today, the smallest unit of storage is a byte (not a bit). But in order to use the compression/decompression algorithms effectively, we must be able to manipulate data at the level of bits, even though we must read and write data one byte at a time. This requires some extra work to program: if we write the bits 0 and 1 as ordinary characters in our output file, then we have wasted 8 times the storage space we needed, since a character takes 8 bits (a whole byte). Remark 6.43. It is a good idea to read the input file twice in the Compression Algorithm: once to collect the message set S together with the frequencies of the messages, and the second time to produce the codewords for the output file. Do not try to read the input file just once and store the entire input file in program memory: consider that the point of data compression is to reduce the size of enormous files, so your input file may be many gigabytes – too big to fit in the working memory of your program. On the other hand, if our message unit is one ASCII character, there is no need to scan the input file separately for each character: this would require 256 separate scans! Example 6.44. Let our message unit be a single ASCII character, and consider the file Fplain which contains the following text: AABCCAAABBAAAABCCAABBAABAACABABABABAABCCABAAAABBBABABA We will apply the Compression Algorithm. 1. S = {A, B, C}. 2. P (A) = 29/54 , P (B) = 18/54 , P (C) = 7/54. 3. Huffman’s algorithm applied to X = (S, P ) produces the following tree (we leave the details to the reader):

102

CHAPTER 6. INFORMATION THEORY AND DATA COMPRESSION

TH = 0

1

0

1

C

B

A

Therefore the Huffman code c is given by cA = 1, cB = 01, and cC = 00. 4. We ignore the code dictionary D here. So the output file will contain just the codewords for the input file plainwords, in the same order. The output file therefore consists of cA + cA + cB + cC + cC + cA + · · · = 1 + 1 + 01 + 00 + 00 + 1 + · · · . In full, the output file consists of the following binary string: 1101000011101011111010000110101110111001011011011011101000010111110101011011011 In order to write this file in a real file system, we need to break it up into bytes. So we would actually write the following bytes: 11010000 11101011 11101000 01101011 10111001 01101101 10111010 00010111 11010101 1011011 But notice that the final “byte” only contains 7 bits. Now that we see it, we realize that this kind of thing will happen often – in fact, 7/8 of the time, if the number of bits of output is uniformly distributed mod 8. So we need to decide what to do with the final byte. A common solution is to add enough extra bits at the end of the final byte in order to reach 8 bits; this is called padding. We may pad using 0s or 1s; the important thing is to know that we are padding at the end instead of at the beginning, say, because this will affect how we decompress. To be definite, let’s pad using 0s. So we would actually output the following bytes to Fcode, again ignoring D: 11010000 11101011 11101000 01101011 10111001 01101101 10111010 00010111 11010101 10110110 Note that Fcode contains 10 bytes, while Fplain contains 54 bytes. This is quite a lot of compression! Again, though, adding D to Fcode will reduce the efficiency of compression. We also note that the average codeword length is len(c) = 79/54 ≈ 1.463, and the entropy of X is H(X) ≈ 1.392. This corroborates Theorem 6.37. Remark 6.45. Because of padding, our code dictionary D should include the information about how many bits were added to the end of the final byte: the amount of padding, p. Because 0 ≤ p ≤ 7, we need 3 bits to store p. Calculating the padding amount p is complicated by the fact that D itself contributes to the total length of the output file!

CHAPTER 7

Game Strategy 1. Introduction In this section, we explore another of the most classic areas of computer science: game strategy. Nowadays we might take it for granted that computers are the best players at just about any game you can name, but for many years, one of the most important challenges in computer science was to create a chess program capable of defeating the human world chess champion. This goal was held up as a “gold standard” of artifical intelligence, the branch of computer science which studies complex decision-making under conditions of limited information. In order to reach this goal of creating artificial intelligence in a game algorithm, we first step back and ask what kind of games we will study. We will limit ourselves to games with the following properties: (1) There are exactly 2 players, whom we may call P1 and P2 (2) The 2 players alternate turns: first P1 takes a turn, then P2 , then P1 again, etc.; or vice-versa (3) Both players always have complete information about the state of the game (4) The game is deterministic: at each turn, the player chooses a “move” which completely determines the next state of the game (5) The game is finite: that is, • At each turn, the number of available moves is finite; and • Every game ends after a finite number of moves. (6) At the end state of any game, one of 3 possibilities must occur: either P1 wins, or P2 wins, or the game is a draw (i.e. a tie). Definition 7.1. A game which satisfies all of the properties above is called a finite 2-player game of pure strategy. Example 7.2. The reader should try to decide which of the following well-known games satisfy our properties: tic-tac-toe; checkers; chess; monopoly; stratego; blackjack; Go; bridge. Think about this and write down your answers before reading more.

103

104

CHAPTER 7. GAME STRATEGY

We claim that tic-tac-toe, checkers, chess, and Go meet our requirements, but the other games we listed do not. Monopoly involves rolling dice, so it is not deterministic. In stratego and blackjack, the complete state of the game is not generally known to both players, and additionally, blackjack is non-deterministic since the cards should be shuffled randomly. Bridge suffers from both of the above deficiencies, and also requires 4 players instead of 2. 2. Decision Trees The requirements we imposed on our game allow us to represent the game using a convenient data structure which we are already well familiar with: namely, a rooted tree. Definition 7.3. Suppose G is a finite 2-player game of pure strategy. Then the decision tree of G is the tree T described as follows: a vertex of T is a single state of the game; the root of T is the game’s initial state; an edge (a, b) exists whenever state b can be reached from state a by making a single move. Remark 7.4. A leaf of the decision tree corresponds to a finished game, which may therefore be labeled with either P1 , P2 , or DRAW, indicating who won the game. Furthermore, we consider each edge to be directed away from the root, i.e., to go in one direction only, corresponding to the flow of the game over time; this is why we used an ordered pair (a, b) to represent an edge, instead of the unordered set {a, b}. Example 7.5. In this example, we illustrate a possible decision tree for a very “small” game. In this game, player P1 has the first move, and there are 6 choices available to them. After that, player P2 can move – unless P1 chose the 2nd or 4th move, in which case the game is already over! In all cases, the game has ended after two turns. Note that even though technically the states at depth 2 in this tree represent player P1 ’s turn, there are actually no moves to make there, since these are end states of the game; similarly for the leaf nodes at depth 1, which are technically P2 ’s turn. Whose Turn? P1

DRAW

P2

P2

P2

P1

DRAW P1

P2

P1

DRAW

Figure 7.1. An example of a small decision tree

P2

P2

P1

2. DECISION TREES

105

Remark 7.6. The decision tree of a game represents every possible way the game can be played. Often, the decision tree will be enormous, because of exponential growth of the number of nodes at each level as we move down from the root. This is in contrast to the Huffman tree, in which the number of nodes is just 2n − 1 where n is the number of messages. Remark 7.7. When we form the decision tree of a game, we consider the result of every move to be a unique state that is not repeated anywhere else in the decision tree. One way to formalize this is to say that we consider the history of the players’ moves to be part of the game’s state. Without this assumption, we could get cycles, and we wouldn’t have a tree. A good example to keep in mind is the game of chess. It is possible for the chess board to look exactly the same at two different stages of a game: all the same pieces in the same positions with nothing added or taken away; but we still consider these to be two different states of the game. This is important for the rules of chess, not just for our definition of a decision tree: if for example the same position has re-occurred as a result of moving a rook away from its initial position and then back again, then it is illegal to castle after this move, even if the configuration of the chess board is identical to an earlier configuration where castling was legal. In summary, the current configuration of the board is not enough to determine the state of the game. Next we want to understand the concept of “strategy”. The reader should think hard about how we should define the term “strategy” in a game, and try to make your definition precise enough to apply to our mathematical model of a game as a decision tree, before reading further.

106

CHAPTER 7. GAME STRATEGY

The word “strategy” is derived from a Greek word meaning “general”. A general is a military leader who makes top-level decisions with the goal of securing ultimate victory. A good strategy should have contingency plans that anticipate the enemy’s possible moves, with different responses pre-selected depending on what the enemy does. We can carry over this idea to our decision tree by defining a strategy to be a way for a player to decide how to move in any given situation. That is the content of the following definition. Definition 7.8. Let T = (V, E) be the decision tree of a game G. For i ∈ {1, 2}, let Vi = {x ∈ V : state x is player Pi ’s turn}. A strategy for the player Pi is a function s : Vi → V such that for all x ∈ Vi , we have (x, s(x)) ∈ E. Notice that a strategy only applies to one player. A strategy may or may not be a “good” strategy. But some strategies are better than others, as the following definition illustrates. Definition 7.9. A winning strategy for player Pi is a strategy w for Pi such that for every path (x0 , x1 , . . . , xn ) in T where x0 is the root and xn is a leaf, if Pi has always followed the strategy w in this path, then Pi has won the game at state xn . By “followed the strategy w”, we mean that xj+1 = w(xj ) for all xj ∈ Vi : that is, whenever it is player Pi ’s turn in this path, then Pi chose the move dictated by the strategy w. A player is guaranteed to win the game if they follow a winning strategy. But a winning strategy may not exist for a given player - or for either player. How can we find a winning strategy? Or, failing that, how can we find an optimal strategy? What does “optimal strategy” even mean, and must there always be an optimal strategy? These are the main problems of the present chapter. The reader is invited to think about how to find an optimal strategy – how to decide which move is best from any given state of the game. Please try this before reading further.

3. THE MINIMAX ALGORITHM

107

3. The Minimax Algorithm First, let us make the concept of “optimal strategy” precise. You might think at first that an optimal strategy always gives at least as good a final result as any other given strategy for the same player; but this is not always possible, because even if the best result we can guarantee for our player is a draw, it may be possible for us to follow a “losing” strategy, yet our opponent could make a terrible move which lets us win. Therefore, the idea when we compare two strategies is to compare the worst possible result of the game when following each strategy: Definition 7.10. Let G be a finite 2-player game of pure strategy. Let s be a strategy for player Pi in G. The worst-case outcome of s is the worst result that can occur at the final state of G when player Pi follows strategy s; we order the results in the natural way, where P2 < DRAW < P1 for i = 1, and P1 < DRAW < P2 for i = 2. An optimal strategy for Pi is a strategy whose worst-case outcome is at least as good as the worst-case outcome of any other strategy for Pi . Remark 7.11. A winning strategy for Pi is a strategy whose worst-case outcome is a win for Pi . A winning strategy must be optimal, but an optimal strategy may not be winning. From our definition, it is clear that an optimal strategy for a given finite 2-player game of pure strategy always exists: after all, there are only a finite number of strategies for the game. To make it easier to compare two final states of a game, we adopt the convention of using the label 1 for a state where P1 won, using -1 when P2 won, and using 0 for a draw. Now we are ready to present the main algorithm of this chapter, the Minimax Algorithm. The Minimax Algorithm will tell us the worst-case outcome for each player at every state of the game, by labeling each node of the decision tree with the appropriate number (-1, 0, or 1). The key idea is to work from the bottom of the tree upwards to the root. For if we know the worst-case outcome of taking each possible move (as we certainly do when we are near the very end of the game), then we can deduce the worst-case outcome of the present state quite easily, depending on whose turn it is: player P1 wants the highest possible value, and player P2 wants the lowest possible value. Example 7.12. In the game illustrated by the decision tree of Figure 7.2, if the root node is player P1 ’s turn, then it seems reasonable that move m2 would be chosen, since that move gives P1 the win. But if the root node is player P2 ’s turn, then we would expect either move m1, move m3, or move m5 to be chosen, giving P2 the win. m1

m3 -1

1

-1

m6

m4

m2

m5 0

-1

Figure 7.2. A decision tree of depth 1

0

108

CHAPTER 7. GAME STRATEGY

Applying the basic idea of Example 7.12 recursively from the bottom of the decision tree upwards, we get the following algorithm; its importance lies not so much in the intuitive reasonableness of the procedure, but rather in the fact, which we will prove very soon, that this algorithm is optimal. Algorithm 7.1 (The Minimax Algorithm). Input: The decision tree of a finite 2-player game of pure strategy, with the leaves labeled -1, 0, or 1, according to which player won Output: the same tree T , but with every node labeled with a number Procedure: while there exists an unlabeled node of T { Let x be a lowest unlabeled node of T; Label x with the maximum value of x’s children, if state x is player 1’s turn; Label x with the minimum value of x’s children, if state x is player 2’s turn } Example 7.13. Suppose we take the decision tree in Example 7.5, re-labeled with our new convention of using -1, 0, and 1 instead of P2 , DRAW, and P1 . We show the resulting decision tree in Figure 7.3.

0 0

-1

-1

1 1

-1

1

0

-1

-1

Figure 7.3. The decision tree of Example 7.5, re-labeled using -1,0,1

Let us use the Minimax Algorithm to label the remaining nodes. First we have to know whose turn it is at the root state. There are only 2 cases: (a) Suppose the root state is player P1 ’s turn. Then whose turn it is at the other states is completely determined by our assumption that players alternate turns. The result is shown in Figure 7.4.

Now we apply the Minimax Algorithm to label the unlabeled nodes from the bottom up; we can first label all nodes at depth 1 which are not yet labeled, as shown in Figure 7.5. Note that we never change the labels of leaf nodes: these leaf labels are given to us, and represent the actual end result of a game.

3. THE MINIMAX ALGORITHM

109

Whose Turn? P1

0 0

-1

-1

P2

1 1

-1

1

0

-1

-1

P1

Figure 7.4. The tree of Figure 7.3 with P1 to move first Whose Turn? P1 -1

0

-1

-1

0 -1

1

-1

1

-1

1

0

P2

-1

-1

-1

P1

Figure 7.5. The tree of Figure 7.4 after one round of Minimax labeling Finally, we just need to label the root node by maximizing the values of its children, since at the root state P1 gets to move. This gives Figure 7.6. Whose Turn? 1

-1

0

-1

-1

0 -1

1

-1

P1 -1

1 1

0

-1

-1

-1

P2 P1

Figure 7.6. The tree of Figure 7.4 after Minimax labeling is complete

Not surprisingly, player P1 has a winning strategy in the game of Figure 7.4, since they can just take the move “straight down” which leads to their immediate win. (b) Next, we assume that the root state in Figure 7.3 is Player P2 ’s turn. Applying the Minimax Algorithm, the final result is shown in Figure 7.7.

In this situation, it is Player P2 who has a winning strategy: namely, taking the move to the far right.

110

CHAPTER 7. GAME STRATEGY

Whose Turn? -1

0

0

1

0

-1

-1

1

P2 1

1

-1

1

0

P1

-1

-1

-1

P2

Figure 7.7. The tree of Figure 7.3 with P2 to move, after Minimax labeling is complete In our small examples so far, each player seems to do best when they take a move which leads to an optimal label for that same player. In fact, corresponding to the Minimax Algorithm is a strategy which does precisely that. Definition 7.14 (The Minimax Strategy). Let G be a finite 2-player game of pure strategy. Let T be the decision tree of G, fully labeled by the Minimax Algorithm. The Minimax Strategy for player Pi is the strategy µ such that for all x ∈ Vi , we have µ(x) = the child of x with the label that is optimal for Pi . Here, “optimal for Pi ” means maximum if i = 1 and minimum if i = 2. Remark 7.15. In case of a tie for the optimal child label, we do not specify which child to choose. In practice, we shall see that ties seldom occur after we introduce the Truncated Minimax Algorithm below. Example 7.16. In this example, we illustrate the Minimax Strategy for the game of Example 7.13, assuming that the root state is player P1 ’s turn. We have drawn nodes in V1 (recall, these are the nodes where it is player P1 ’s turn) in blue, and nodes from V2 in red. We have also drawn moves in P1 ’s Minimax Strategy in blue, and moves in P2 ’s Minimax Strategy in red; the remaining moves have been drawn as dashed lines to indicate they are not important here. In cases of a tie in the value of the best move, we have always chosen the left-most move with that value. The result is shown in Figure 7.8. Whose Turn? 1

-1

0

-1

0

-1

-1

1

-1

P1 -1

1 1

0

-1

-1

-1

P2 P1

Figure 7.8. The tree of Figure 7.4 showing the Minimax Strategy

Lemma 7.17. Let G be a finite 2-player game of pure strategy with decision tree T = (V, E). Let λ : V → Z denote the labeling of the nodes of T given by the

3. THE MINIMAX ALGORITHM

111

Minimax Algorithm. Then the Minimax Strategy µ for player Pi has the property that λ(x) = λ(µ(x)) for every x ∈ Vi . Proof. Both the Minimax Algorithm and the Minimax Strategy for Pi choose the child of x which is optimal for Pi . □ The following result lets us understand the relationship between the Minimax Algorithm and optimal strategies. To make the proof work out, it will be useful to first introduce the idea of a “subgame” of a game. ˜ Definition 7.18. Let G be a finite 2-player game of pure strategy. A subgame G of G is the part of G that remains to be played after reaching some state x of G. ˜ is just the subtree of T If T is the decision tree of G, then the decision tree of G x rooted at x. We denote this tree by T , and the corresponding subgame of G by Gx . Lemma 7.19. Let G be a finite 2-player game of pure strategy with decision tree T = (V, E). Then: (i) for each z ∈ V , and for each i ∈ {1, 2}, the Minimax Algorithm labels z with the worst-case outcome of the Minimax Strategy for player Pi applied to the game Gz ; (ii) for each i ∈ {1, 2}, the Minimax Strategy for Pi on G is optimal. Proof. First, note that for any z ∈ V , there is always at least one optimal strategy for a given player Pi in the subgame Gz , and there may be many optimal strategies, but all such optimal strategies have the same worst-case outcome for Pi . Next, note that the Minimax Algorithm on any subgame is consistent with the Minimax Algorithm applied to the full game. Now we proceed by induction on the number of levels of the decision tree T . In the base case, the initial state is also a final state of the game, so there is only one strategy for this game, namely, to do nothing! Inductively, let r be the initial (root) state of the game, and suppose our result is true for any game whose decision tree has fewer levels than T r = T . By the consistency of the Minimax Algorithm and inductive hypothesis, our result is true for every node z of T below r, so we only need to prove our result for the case z = r. Let s be any strategy for G. First suppose that r ∈ V1 and s is a strategy for P1 . Let x = s(r). Because strategy s will definitely reach state x, then the worst-case outcome of s is no better than the worst-case outcome of an optimal strategy for player P1 in the game Gx . By inductive hypothesis, the Minimax Strategy is optimal for player P1 on Gx , and the Minimax label of x is the worst-case outcome of this optimal strategy for Gx . But also, the Minimax label of r is the maximum of the labels of the children of r. Let λ denote the Minimax labeling, w denote the worst-case outcome of a strategy, µ denote the Minimax strategy for P1 on G, and µx denote the Minimax strategy for P1 on Gx . Then we have λ(r) ≥ λ(x) = w(µx ) ≥ w(s).

(7.1)

112

CHAPTER 7. GAME STRATEGY

We also have λ(r) = λ(y) = w(µy ) for some child y of r where λ(y) is maximum and µ(r) = y. Therefore, w(µ) = w(µy ) = λ(y) = λ(r). This gives (i). Since w(µ) = λ(r) ≥ w(s), and s was arbitrary, we also have (ii) for i = 1. Next suppose that r ∈ V1 but s is a strategy for P2 . Let µ ˜ denote the Minimax strategy for P2 on G, and similarly for µ ˜x when x is a child of r. Let y = µ(r) as before. Then we have w(˜ µ) = max{w(˜ µx ) : x is a child of r}, since player P2 does not get to choose which move is made from state r, and since higher outcomes are worse for player P2 . By the Minimax procedure, we have λ(r) = max{λ(x) : x is a child of r}. By inductive hypothesis, we have w(˜ µx ) = λ(x) for each child x of r. By Lemma 7.17, we have λ(r) = λ(µ(r)) = λ(y). So we have w(˜ µ)

=

max{w(˜ µx ) : x is a child of r}

(7.2)

=

max{λ(x) : x is a child of r}

(7.3)

= λ(r).

(7.4)

This is (i). Similarly, we have w(s)

=

max{w(sx ) : x is a child of r}

(7.5)



max{w(˜ µx ) : x is a child of r}

(7.6)

= max{λ(x) : x is a child of r} = λ(r),

(7.7) (7.8)

where sx denotes the restriction of s to T x . Therefore w(s) ≥ w(˜ µ), which gives (ii) for i = 2. The situation r ∈ V2 is completely symmetric. □ Remark 7.20. A more positive way to say “worst-case outcome” is to say “best result that can be guaranteed”. Thus, we have proved that the Minimax Algorithm labels each node x with the best result that can be guaranteed in the game if we are at state x and follow the Minimax Strategy. This is also the best result that can be guaranteed under any strategy, since the Minimax Strategy is optimal. But the best result for which player? Remarkably, we have shown that the Minimax label of a node x is the best result that can be guaranteed once state x is reached, no matter which player we consider and no matter whose turn it is at state x. If both players follow the Minimax Strategy, then in fact this is the result that will occur. 4. The Truncated Minimax Algorithm

For many games (especially the more interesting ones), the decision tree T is far too big to run the Minimax Algorithm in a reasonable amount of time. Instead, we artificially remove everything below a certain level of T ; this is called truncation of T (infinitive verb: “to truncate”). But when we truncate T , we often create artificial leaves at the bottom of the new tree, which are not actually end states of our game. In order to find a good strategy, it would be nice if we could at least get some idea of which player is more likely to win the game at these artificial leaf states – who is ahead, and by how much? For then we could apply the Minimax Algorithm as before.

4. THE TRUNCATED MINIMAX ALGORITHM

113

Definition 7.21. Let T be a rooted tree, and let d be a non-negative integer. Then Td denotes the truncation of T at depth d. Next, we formalize what we mean by measuring which player is ahead at any given state of a game. The idea is to replace the simple but certain measures -1, 0, 1 that we used in the actual decision tree with more general (real) numbers that tell us (roughly) how much ahead the first player is compared to the second player. Definition 7.22. Let G be a finite 2-player game of pure strategy with decision tree T = (V, E). An objective function for G is a function f : V → R. Remark 7.23. An objective function is supposed to measure the relative advantage of player P1 over player P2 at any state of the game. But our definition does not say anything about how to measure this advantage; we allow any numerical function to be a valid objective function! In practice, deciding on a good objective function for a given game is one of the most important challenges in building a strong gameplaying program; the details will depend strongly on the game being played. One of the guiding principles to keep in mind is that we are supposed to be at the bottom of the (artificial) decision tree when we apply an objective function, in order to save time, so the objective function itself should not have high time-complexity; in particular, it should not involve (too much) look-ahead in the actual decision tree. Now we are ready to present the practical variation of the most important algorithm of this chapter; to distinguish it from the previous algorithm, we call this one the Truncated Minimax Algorithm and we call the original the Full Minimax Algorithm from now on. Algorithm 7.2 (Truncated Minimax Algorithm). Input: A game G; a search depth d; an objective function f Output: A labeling of the top d levels of the decision tree of G Procedure: 1.

Form the truncated decision tree Td of G

2.

Label all LEAVES of Td using f

3.

Apply the Full Minimax Procedure to Td .

Example 7.24. Suppose that we have a game whose decision tree is shown in Figure 7.9. The nodes here are labeled with the values of an objective function.

We note that this example is just for the purpose of illustrating the Truncated Minimax Algorithm; in a realistic example with a good objective function, we expect the leaf nodes to be labeled with ±∞ or 0 (or something “close” to those values) since these represent final states of the game. We will apply the Truncated Minimax Algorithm with a search depth of 2. Step 1: Truncate the given tree at depth 2. The result is shown in Figure 7.10.

114

CHAPTER 7. GAME STRATEGY

Whose Turn? 0.1

-0.46 -1.7

0.25

-3.3

-14.3

-3.13

2.4

1.68

3.82

-11.2

P1 1.29

-8.6 1.97

-2.34

0.2

P2

-1.62

-9.1

-13

P1 P2

10.9

Figure 7.9. A decision tree T labeled with an objective function Whose Turn? 0.1

-0.46

0.25

-1.7

-14.3 2.4

-3.3

1.68

P1 1.29

-8.6

1.97

-2.34

0.2

P2

-1.62

-9.1

-13

P1

Figure 7.10. The truncated decision tree T2 labeled with the objective function

Next, we discard the objective function values EXCEPT at the leaves of the tree T2 , to get Figure 7.11. Whose Turn? P1

-14.3 0.25

-1.7

2.4

P2

-8.6 1.68

1.97

-2.34

0.2

-9.1

-13

P1

Figure 7.11. The truncated decision tree T2 with only the leaves labeled with the objective function

Finally, we apply the Full Minimax Algorithm to get the result shown in Figure 7.12.

4. THE TRUNCATED MINIMAX ALGORITHM

115

Whose Turn? 1.68

-1.7

0.25

-1.7

1.68

-14.3 2.4

1.68

1.97

P1 -9.1

-8.6 -2.34

0.2

-13

-9.1

-13

P2 P1

Figure 7.12. The truncated decision tree T2 labeled with the final Minimax values

Example 7.25 (The Objective Function for Chess). We assume that the reader is familiar with the pieces and rules of the game of chess. The goal of chess originally was to capture the opponent’s king; but over time, the rules changed so that a game of chess is officially over when a king could be captured two turns after the current state no matter what move is made next. This last situation is called checkmate. Besides checkmate, a chess game can end in a draw when the same board position has occurred 3 times, when the player whose turn it is has no moves available, or when both players agree to a draw. Ordinary chess games have dozens of possible moves at each turn and can last for several dozen turns, so the decision tree for chess is too big to traverse completely. So we want to find an objective function which tells us, quickly, which player is ahead at a given state, and by how much. Several measurements can be used to form an objective function for chess: (1) Material: We can assign a value to every piece, and add up the values of all pieces that are currently on the board to get a total. (The pieces in chess are sometimes called material.) To be consistent with the Minimax Algorithm, we should assign positive values to white pieces and the corresponding negative values to black pieces. But we do not typically restrict ourselves to the range -1 to 1. Traditionally, a pawn is the “unit” standard, with a value of ±1 point, with the other pieces having higher values. A queen is usually assigned a value of about ±9. The assignment of piece values relative to pawn values is one of the opportunities for creativity and variability when making a chess engine. In fact, part of what we call the “style” of the best human chess players has to do with how much value they place on each piece; some commerical chess programs have been hard-coded with different sets of piece values to simulate the style of various famous human players of the past. More recently, chess games have been programmed using a form of machine learning to teach themselves the best values to assign each chess piece! (2) Maneuverability: We can add up the total number of moves available to each player at the given state. The idea is that having more moves gives you a better chance of finding an advantageous move, and ultimately a better chance of winning. But again we should count each move for black as a subtraction and each move for white as an addition to get the net maneuverability; and we should count each available move as only a small fraction of 1 point, since having a single extra move available is probably not worth nearly as much as having another pawn. But of course at one given state of the game, only one player is allowed to move; so we

116

CHAPTER 7. GAME STRATEGY

artificially pretend it’s the other player’s turn when we count the number of moves available to that player. (3) Control of the Board: We can count the total number of distinct squares on the chess board to which each player can move from the given state. This is similar to, but not the same as, maneuverability; and similar considerations apply. Some experts advise giving higher weight to control of the squares nearer to the center of the board, with the idea that central squares are more advantageous to occupy. (4) Exchange of Pieces: A common occurrence in advanced chess games is an exchange of pieces, where one or more pieces of each player is captured by the other player in consecutive turns. Often, a single occupied square on the chess board becomes a “focal point”, when several pieces of both players position themselves so they can move to that square in their next turn. In these situations, we say that the pieces are “attacking” the occupied square. If the piece at the occupied square is captured by its opponent, this can set off a chain reaction of captures at the same square; this is a (multiple) exchange of pieces. Typically, players would begin an exchange by using their least valuable attacking pieces to perform the capture, and using more valuable pieces to capture as the exchange progressed; this is obviously to prevent the opponent from quitting the exchange when they are ahead. We can take exchanges into account in our objective function by looking at each occupied square and determining the relative advantage to each player of the exchange that could occur there. Note that since an exchange can be “interrupted” by a move unrelated to the exchange in question – typically, by a move that produces check or initiates another exchange – the computation of relative advantage in an exchange using just the current board configuration is not always accurate. But we can’t expect perfection from our objective function: the whole point is that we have to settle for an approximation since the decision tree is too big to handle in its entirety. Finally, we note that an actual leaf node in the full decision tree needs to be treated in a way that is consistent with the objective function for other nodes. One way to do this is to count checkmate as ±∞ (or some very large value); or more naively, to value the king itself at ±∞ or approximately so. Example 7.26 (Check and Checkmate). Recall that in the game of chess, check is a state in which a king is under attack. More formally, a player Pi is in check if it is currently Pi ’s turn, and if Pi ’s king could be captured were it the opponent’s turn. A player is in checkmate if it is in check and if no available move can prevent its king from being captured. Checkmates are leaves in the decision tree of chess, along with drawn games. The modern rules of chess state that if a player is in check but not in checkmate, then they must immediately move out of check, i.e., prevent their own king from being captured. The rules also forbid a player from moving “into check”, i.e., making a move that allows their king to be captured. These rules can be tricky to implement effectively. A naive implementation could run as follows: collectLegalMoves() { for each potential move m {

6. PRUNING

117

make move m; if !inCheck() then add m to list of legal moves; } } inCheck(): { collectLegalMoves(); for each legal move m { if m captures a king then return true; } return false; } We can see that the indirect recursion will be infinite unless we do something to prevent it. We should also remember to “un-make” each move in these loops. 5. Implementing Chess with the Minimax Algorithm Since the Minimax Strategy is optimal, we want to use this strategy to decide on the best move; we need to do this using the Truncated Minimax Algorithm, since chess has a large decision tree. Although the theory of game strategy uses rooted trees for everything, we do not need or want to create an actual tree object for our chess program. Instead, we should use recursion to evaluate a given state of the chess game (or “chess position”, as we call it). The anchor state of the recursion is when the search depth is 0, or when the current player is in checkmate (or when one of the kings is gone from the chess board, if we play by the old rules). Remember that it is only in the anchor case that we use the objective function. 6. Pruning In gardening, to “prune” a tree means to cut off some of its branches – often, dead or dying branches, or branches that we do not need or want to develop – in order to improve the tree. A similar idea exists in game strategy: if we can somehow be sure that exploring a given subtree of the decision tree will not change the Minimax label of a node x, then we can neglect to explore this subtree, in effect pruning that part of the decision tree, when computing the label of x. This can greatly improve the time-complexity of the Minimax Algorithm. Example 7.27. Consider the decision tree T shown in Figure 7.13. Each node is labeled as xi with i ∈ {0, 1, . . . , 7}. The leaves are also labeled with the values of the objective function.

Suppose that the Minimax Algorithm traverses this tree using Depth-First Search, looping over nodes in order by their index. Then the search order of the nodes is x0 , x1 , x3 , x4 , x2 , x5 , x6 , x7 . Suppose that we keep track of the best value b that

118

CHAPTER 7. GAME STRATEGY

Whose Turn? x0

P1

x1 x3 -2

x2 x4

3

x5 -2

1

x6

P2 x7 -6

P1

Figure 7.13. A decision tree that can be pruned we found so far for the root node x0 . After we finish exploring the subtree T x1 , then we have b = −2. After exploring node x5 , even before exploring the other children x6 and x7 of node x2 , we can say that λ(x2 ) ≤ −2. Therefore, the actual Minimax labels of nodes x6 and x7 do not matter, because moving to state x2 from state x0 cannot possibly be better for player P1 than moving to state x1 . So we can safely break out of the loop at node x2 at this point, without exploring the remaining children of x2 . We have pruned the decision tree! More generally, we have the following algorithm: Algorithm 7.3 (Minimax with Branch-and-Bound Pruning). Input: A node x of a decision tree T whose leaves are already labeled; A number b (called the bound). Output: A real number ℓ Procedure: double Minimax_BB(Node x, double b) { if x is a leaf of T, then return the label of x; Set i = whose turn it is at state x; //1 or 2 Set localBest = the worst possible value for Player i; for each child y of x { Set c = Minimax_BB(y, localBest); if c is better for Player i than localBest, then set localBest = c; if c is at least as good for Player i than b, then break; } return localBest; } Remark 7.28. We may take the worst possible value for Player 1 to be ∞, and the worst possible value for Player 2 to be −∞. Likewise, “c is better for Player i than localBest” means c > localBest if i = 1, but means c < localBest if i = 2. Remark 7.29. If we remove the single line with the conditional break statement, then the value of b is never used, and we get the original Minimax procedure for the subtree of T rooted at x.

6. PRUNING

119

Example 7.30. Let us use the Branch-and-Bound Minimax algorithm (Algorithm 7.3) on the decision tree from Example 7.27. We first need to choose a node and a value for the bound b to start with. It is natural to choose the root node x0 . But what about b? We use the extreme value b = ∞, the best possible value for Player 1. The motivation for this value of b is that when we first enter the algorithm, we do not have any prior estimate for the Minimax label of x0 , so we cannot hope to break early. Minimax_BB(x0, INFINITY): x0 is not a leaf i = 1; localBest = -INFINITY; Loop: y = x1 c = Minimax_BB(x1, -INFINITY): x1 is not a leaf i = 2; localBest = INFINITY; Loop: y = x3 c = Minimax_BB(x3, INFINITY): x3 is a leaf: return -2; c = -2; -2 is better for Player 2 than INFINITY: Set localBest = -2; -2 is not at least as good for Player 2 as -INFINITY Loop: y = x4 c = Minimax_BB(x4, -2): x4 is a leaf: return 3; c = 3; 3 is not better for Player 2 than -2 3 is not at least as good for Player 2 as -INFINITY End Loop return -2; c = -2; -2 is better for Player 1 than -INFINITY: Set localBest = -2; -2 is not at least as good for Player 1 as INFINITY Loop: y = x2 c = Minimax_BB(x2, -2): x2 is not a leaf i = 2; localBest = INFINITY; Loop: y = x5 c = Minimax_BB(x5, INFINITY): x5 is a leaf: return -2; c = -2; -2 is better for Player 2 than INFINITY: Set localBest = -2; -2 is at least as good for Player 2 as -2: break; End Loop return -2; c = -2; -2 is not better than -2 for Player 1

120

CHAPTER 7. GAME STRATEGY

-2 is not at least as good for Player 1 as INFINITY End Loop return -2; Thus, the procedure returns the value -2, which we know is actually the correct Minimax label for node x0 . But in general, what is the meaning of the return value ℓ of the Branch-and-Bound Minimax Algorithm, and will it always give the correct Minimax label for the input node x if we supply a best-possible value for b? Those questions are answered by the following result. Proposition 7.31. Let x be a node in a decision tree T . Let i be the index (1 or 2) of the player whose turn it is at state x. Let a ≤i b denote the relation that the value a is at most as good as b for Player Pi (and similarly for the other inequality symbols). Let b be any real number, let v = Minimax BB(x, b), and let m = λ(x), where λ is the Minimax labeling. Then we have

v = m = λ(x) ,

if m ≤i b;

(7.9)

b ≤i v ≤i m

if m ≥i b.

(7.10)

,

Proof. We proceed by induction on the depth d of the subtree T x . Base case: d = 0. Then x is a leaf of T , so we have v = m. So the result holds. Inductive step: Suppose that d ≥ 1 and our result is true for smaller values of d. We claim first that throughout the y loop, we have localBest = max{λ(y)} among all children y of x considered so far. i

(7.11)

Since we initialize localBest to the worst possible value for Pi , the value of localBest will change in the first iteration of the loop. Changes to the value of localBest occur when we set the new value of localBest to c = Minimax BB(y, localBest) in case localBest j c where j = 3 − i represents the opponent of i; so by inductive hypothesis, we must have λ(y)