Command Line Fundamentals: Learn to use the Unix command-line tools and Bash shell scripting 178980776X, 9781789807769

Master shell basics and Unix tools and discover easy commands to perform complex tasks with speed Key FeaturesLearn why

558 162 4MB

English Pages 314 Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Command Line Fundamentals: Learn to use the Unix command-line tools and Bash shell scripting
 178980776X, 9781789807769

Table of contents :
Preface
Introduction to the Command Line
Introduction
Command Line: History, Shells, and Terminology
History of the Command Line
Command-Line Shells
Command-Line Terminology
Exploring the Filesystem
Filesystems
Navigating Filesystems
Exercise 1: Exploring Filesystem Contents
Manipulating a Filesystem
Exercise 2: Manipulating the Filesystem
Activity 1: Navigating the Filesystem and Viewing Files
Activity 2: Modifying the Filesystem
Shell History Recall, Editing, and Autocompletion
Command History Recall
Exercise 3: Exploring Shell History
Command-Line Shortcuts
Exercise 4: Using Shell Keyboard Shortcuts
Command-Line Autocompletion
Exercise 5: Completing a Folder Path
Exercise 6: Completing a Command
Exercise 7: Completing a Command using Options
Activity 3: Command-Line Editing
Shell Wildcards and Globbing
Wildcard Syntax and Semantics
Wildcard Expansion or Globbing
Exercise 8: Using Wildcards
Activity 4: Using Simple Wildcards
Activity 5: Using Directory Wildcards
Summary
Command-Line Building Blocks
Introduction
Redirection
Input and Output Streams
Use of Operators for Redirection
Using Multiple Redirections
Heredocs and Herestrings
Buffering
Exercise 9: Working with Command Redirection
Pipes
Exercise 10: Working with Pipes
Text-Processing Commands
Shell Input Concepts
Filtering Commands
Exercise 11: Working with Filtering Commands
Transformation Commands
Exercise 12: Working with Transformation Commands
Activity 6: Processing Tabular Data – Reordering Columns
Activity 7: Data Analysis
Summary
Advanced Command-Line Concepts
Introduction
Command Lists
Command List Operators
Using Multiple Operators
Command Grouping
Exercise 13: Using Command Lists
Job Control
Keyboard Shortcuts for Controlling Jobs
Commands for Controlling Jobs
Regular Expressions
Elements
Quantifiers
Anchoring
Subexpressions and Backreferences
Exercise 14: Using Regular Expressions
Activity 8: Word Matching with Regular Expressions
Shell Expansion
Environment Variables and Variable Expansion
Arithmetic Expansion
Brace Expansion
Recursive Expansion with eval
Command Substitution
Process Substitution
Exercise 15: Using Shell Expansions
Activity 9: String Processing with eval and Shell Expansion
Summary
Shell Scripting
Introduction
Conditionals and Loops
Conditional Expressions
Conditional Statements
Loops
Loop Control
Shell Functions
Function Definition
Function Arguments
Return Values
Local Variables, Scope, and Recursion
Exercise 16: Using Conditional Statements, Loops, and Functions
Shell Line Input
Line Input Commands
Internal Field Separator
Exercise 17: Using Shell Input Interactively
Shell Scripts
Shell Command Categories
Program Launch Process
Script Interpreters
Practical Case Study 1: Chess Game Extractor
Understanding the Problem
Exercise 18: Chess Game Extractor – Parsing a PGN File
Exercise 19: Chess Game Extractor – Extracting a Desired Game
Refining Our Script
Exercise 20: Chess Game Extractor – Handling Options
Adding Features
Exercise 21: Chess Game Extractor – Counting Game Moves
Tips and Tricks
Suppressing Command Output
Arithmetic Expansion
Declaring Typed Variables
Numeric for Loops
echo
Array Reverse Indexing
shopt
Extended Wildcards
man and info Pages
shellcheck
Activity 10: PGN Game Extractor Enhancement
Practical Case Study 2: NYC Yellow Taxi Trip Analysis
Understanding the Dataset
Exercise 22: Taxi Trip Analysis – Extracting Trip Time
Exercise 23: Taxi Trip Analysis – Calculating Average Trip Speed
Exercise 24: Taxi Trip Analysis – Calculating Average Fare
Activity 11: Shell Scripting – NYC Taxi Trip Analysis
Summary
Appendix
Index
_GoBack
_Hlk523751376
_Hlk524548687
_GoBack
_GoBack
_Hlk525844716
_Hlk526802910
_Hlk525264096
_Hlk526376049
_Hlk527331650
_GoBack
_Hlk528531641
_Hlk531297997
_GoBack
_GoBack
_Hlk531294793

Citation preview

Command Line Fundamentals Learn to use the Unix command-line tools and Bash shell scripting

Vivek N

b|

Command Line Fundamentals Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Author: Vivek N Technical Reviewer: Sundeep Agarwal Managing Editor: Neha Nair Acquisitions Editor: Koushik Sen Production Editor: Samita Warang Editorial Board: David Barnes, Ewan Buckingham, Simon Cox, Manasa Kumar, Alex Mazonowicz, Douglas Paterson, Dominic Pereira, Shiny Poojary, Saman Siddiqui, Erol Staveley, Ankita Thakur, and Mohita Vyas First Published: December 2018 Production Reference: 1211218 ISBN: 978-1-78980-776-9

|c

Table of Contents Preface   i Introduction to the Command Line   1 Introduction .....................................................................................................  2 Command Line: History, Shells, and Terminology ......................................  3 History of the Command Line ............................................................................. 3 Command-Line Shells ........................................................................................... 4 Command-Line Terminology ............................................................................... 5

Exploring the Filesystem ................................................................................  5 Filesystems ............................................................................................................ 5 Navigating Filesystems ......................................................................................... 8 Exercise 1: Exploring Filesystem Contents ...................................................... 10 Manipulating a Filesystem ................................................................................. 17 Exercise 2: Manipulating the Filesystem .......................................................... 19 Activity 1: Navigating the Filesystem and Viewing Files ................................ 28 Activity 2: Modifying the Filesystem ................................................................. 29

Shell History Recall, Editing, and Autocompletion ...................................  31 Command History Recall ................................................................................... 31 Exercise 3: Exploring Shell History .................................................................... 34 Command-Line Shortcuts .................................................................................. 35 Exercise 4: Using Shell Keyboard Shortcuts ..................................................... 37 Command-Line Autocompletion ....................................................................... 38 Exercise 5: Completing a Folder Path ............................................................... 39 Exercise 6: Completing a Command ................................................................. 40

d| Exercise 7: Completing a Command using Options ........................................ 40 Activity 3: Command-Line Editing ..................................................................... 41

Shell Wildcards and Globbing ......................................................................  42 Wildcard Syntax and Semantics ........................................................................ 43 Wildcard Expansion or Globbing ....................................................................... 43 Exercise 8: Using Wildcards ............................................................................... 44 Activity 4: Using Simple Wildcards .................................................................... 49 Activity 5: Using Directory Wildcards ............................................................... 49

Summary ........................................................................................................  50

Command-Line Building Blocks   53 Introduction ...................................................................................................  54 Redirection ....................................................................................................  54 Input and Output Streams ................................................................................. 54 Use of Operators for Redirection ...................................................................... 55 Using Multiple Redirections .............................................................................. 56 Heredocs and Herestrings ................................................................................. 57 Buffering .............................................................................................................. 59 Exercise 9: Working with Command Redirection ............................................ 61

Pipes ...............................................................................................................  67 Exercise 10: Working with Pipes ........................................................................ 68

Text-Processing Commands ........................................................................  72 Shell Input Concepts ........................................................................................... 73 Filtering Commands ........................................................................................... 78 Exercise 11: Working with Filtering Commands .............................................. 87 Transformation Commands .............................................................................. 94 Exercise 12: Working with Transformation Commands ..............................  101 Activity 6: Processing Tabular Data – Reordering Columns  .......................  105

|e Activity 7: Data Analysis ..................................................................................  108

Summary ......................................................................................................  110

Advanced Command-Line Concepts   113 Introduction .................................................................................................  114 Command Lists ...........................................................................................  114 Command List Operators ...............................................................................  114 Using Multiple Operators ...............................................................................  117 Command Grouping ........................................................................................  118 Exercise 13: Using Command Lists ................................................................  119

Job Control ...................................................................................................  122 Keyboard Shortcuts for Controlling Jobs ......................................................  123 Commands for Controlling Jobs .....................................................................  124

Regular Expressions ...................................................................................  129 Elements ...........................................................................................................  130 Quantifiers ........................................................................................................  132 Anchoring ..........................................................................................................  133 Subexpressions and Backreferences ............................................................  134 Exercise 14: Using Regular Expressions ........................................................  136 Activity 8: Word Matching with Regular Expressions ..................................  141

Shell Expansion ...........................................................................................  142 Environment Variables and Variable Expansion ..........................................  143 Arithmetic Expansion ......................................................................................  146 Brace Expansion ...............................................................................................  149 Recursive Expansion with eval .......................................................................  150 Command Substitution ...................................................................................  152 Process Substitution ........................................................................................  153 Exercise 15: Using Shell Expansions ..............................................................  155

f| Activity 9: String Processing with eval and Shell Expansion .......................  160

Summary ......................................................................................................  162

Shell Scripting   165 Introduction .................................................................................................  166 Conditionals and Loops ..............................................................................  166 Conditional Expressions ..................................................................................  166 Conditional Statements ..................................................................................  173 Loops .................................................................................................................  178 Loop Control .....................................................................................................  184

Shell Functions ............................................................................................  186 Function Definition ..........................................................................................  186 Function Arguments ........................................................................................  188 Return Values ...................................................................................................  192 Local Variables, Scope, and Recursion ..........................................................  193 Exercise 16: Using Conditional Statements, Loops, and Functions ...........  197

Shell Line Input ...........................................................................................  200 Line Input Commands .....................................................................................  200 Internal Field Separator ..................................................................................  205 Exercise 17: Using Shell Input Interactively ..................................................  209

Shell Scripts .................................................................................................  212 Shell Command Categories ............................................................................  212 Program Launch Process ................................................................................  213 Script Interpreters ...........................................................................................  213

Practical Case Study 1: Chess Game Extractor ........................................  217 Understanding the Problem ...........................................................................  217 Exercise 18: Chess Game Extractor – Parsing a PGN File ............................  219 Exercise 19: Chess Game Extractor – Extracting a Desired Game .............  222

|g Refining Our Script ..........................................................................................  225 Exercise 20: Chess Game Extractor – Handling Options .............................  226 Adding Features ...............................................................................................  230 Exercise 21: Chess Game Extractor – Counting Game Moves ....................  231

Tips and Tricks .............................................................................................  240 Suppressing Command Output ......................................................................  240 Arithmetic Expansion ......................................................................................  240 Declaring Typed Variables ..............................................................................  243 Numeric for Loops ...........................................................................................  244 echo ...................................................................................................................  246 Array Reverse Indexing ...................................................................................  246 shopt ..................................................................................................................  247 Extended Wildcards .........................................................................................  248 man and info Pages .........................................................................................  248 shellcheck .........................................................................................................  249 Activity 10: PGN Game Extractor Enhancement ..........................................  249

Practical Case Study 2: NYC Yellow Taxi Trip Analysis ...........................  250 Understanding the Dataset ............................................................................  251 Exercise 22: Taxi Trip Analysis – Extracting Trip Time .................................  251 Exercise 23: Taxi Trip Analysis – Calculating Average Trip Speed ..............  258 Exercise 24: Taxi Trip Analysis – Calculating Average Fare .........................  260 Activity 11: Shell Scripting – NYC Taxi Trip Analysis ....................................  263

Summary ......................................................................................................  265

Appendix   269 Index   285

>

Preface

About This section briefly introduces the author, the coverage of this book, the technical skills you'll need to get started, and the hardware and software required to complete all of the included activities and exercises.

ii | Preface

About the Book From the Bash shell to traditional UNIX programs, and from redirection and pipes to automating tasks, Command Line Fundamentals teaches you all you need to know about how command lines work. The most basic interface to a computer, the command line, remains the most flexible and powerful way of processing data and performing and automating various day-today tasks. Command Line Fundamentals begins by exploring the basics and then focuses on the most common tool, the Bash shell (which is standard on all Linux and macOs/ iOS systems). As you make your way through the book, you'll explore the traditional UNIX command-line programs implemented by the GNU project. You'll also learn how to use redirection and pipelines to assemble these programs to solve complex problems. Next, you'll learn how to use redirection and pipelines to assemble those programs to solve complex problems. By the end of this book, you'll have explored the basics of shell scripting, which will allow you to easily and quickly automate tasks.

About the Author Vivek N is a self-taught programmer who has been programming for almost 30 years now, since the age of 8, with experience in X86 Assembler, C, Delphi, Python, JavaScript, and C++. He has been working with various command-line shells since the days of DOS 4.01, and is keen to introduce the new generation of computer users to the power it holds to make their lives easier.

Objectives • Use the Bash shell to run commands • Utilize basic Unix utilities such as cat, tr, sort, and uniq • Explore shell wildcards to manage groups of files • Apply useful keyboard shortcuts in shell • Employ redirection and pipes to process data • Write both basic and advanced shell scripts to automate tasks

Audience Command Line Fundamentals is for programmers who use GUIs but want to understand how to use the command line to complete tasks more quickly.

About the Book | iii

Approach Command Line Fundamentals takes a hands-on approach to the practical aspects of exploring UNIX command-line tools. It contains multiple activities that use reallife business scenarios for you to practice and apply your new skills in a highly relevant context.

Hardware Requirements For the optimal student experience, we recommend the following hardware configuration: • Processor: Any modern processor manufactured after 2010 • Memory: 4 GB RAM • Storage: 4 GB available hard disk space

Software Requirements The ideal OS for this book is a modern Linux distribution. However, there are many dozens of flavors of Linux, with different versions, and several other OS platforms, including Windows and macOS/iOS, which are widely used. In order to make the book accessible to students using any OS platform or version, we will use a virtual machine to ensure a uniform isolated environment. If you are not familiar with the term, a virtual machine lets an entire computer be simulated within your existing one; hence, you can use another OS (in this case, a tiny cut-down Linux distribution) as if it were running on actual hardware, completely isolated from your regular OS. The advantage of this approach is a simple, uniform experience for all students, regardless of the system used. Another advantage is that the VM is sandboxed and anything performed within it will not interfere in any way with the existing system. Finally, VMs allow snapshotting, which allows you to undo any serious mistakes you may make with little effort. Once you have completed the exercises and activities in this book in the VM, you can experiment with the command-line support that is available on your individual system. Those who wish to use the commands learned in this book on their systems directly should refer to the documentation for their specific platforms, to ensure that they work as expected. For the most part, the behaviors are standard, but some platforms might only support older versions of some commands, might lack some options for some commands, or completely lack support for certain commands: • Linux: All up-to-date Linux distributions will support all the commands and techniques taught in this book. Some may require the installation of additional packages.

iv | Preface • Windows: The Windows Linux Subsystem allows a few Linux distributions, such as Ubuntu and Debian, to run from within Windows. Some packages may require installation to support everything covered in this book. • macOS and iOS: These OSes are based on FreeBSD, which is a variant of UNIX, and they include most of the GNU tools. Some packages may require installation to support everything covered in this book. Note If you use the VM, all the sample data required to complete the exercises and activities in this book will automatically be fetched and installed in the correct location, when the VM is started the first time. On the other hand, if you decide to use your native OS install, you will have to download the ZIP files (Lesson1.zip to Lesson4. zip) present in the code repository on GitHub and extract them into the home directory of your user account. The data consists of four folders, called Lesson1 to Lesson4, and several commands in the exercises rely on the data being in the locations ~/Lesson1 and so on. It is recommended that you stick to the VM approach unless you know what you are doing.

Installation and Setup Before you start this book, you need to install the following software. You will find the steps to install these here: Installing VirtualBox Download the latest version of VirtualBox from https://www.virtualbox.org/wiki/ Downloads and install it.

About the Book | v Setting up the VM 1. Download the VM appliance file, Packt-CLI.ova, from the Git repository here: https://github.com/TrainingByPackt/Command-Line-Fundamentals/blob/ master/Packt-CLI.ova. 2. Launch VirtualBox and select File | Import Appliance:

Figure 0.1: A screenshot showing how to make the selection

vi | Preface The following dialog box will appear:

Figure 0.2: A screenshot displaying the dialog box

3. Browse for the Packt-CLI.ova file downloaded earlier and click Next, after which the following dialog box should be shown. The path where the Virtual Disk Image is to be saved can be changed if you wish, but the default location should be fine. Ensure there is at least 4 GB of free space available:

Figure 0.3: A screenshot showing the path where the Virtual Disk Image will be saved

About the Book | vii 4. Click Import to create the virtual machine. After the process completes, the VM name will be visible in the left-hand panel of the VirtualBox window:

Figure 0.4: A screenshot showing the successful installation of VirtualBox

5. Double-click the VM entry, Packt-CLI, to start the VM. You will see a lot of text scroll by as it boots up, and after a few seconds a GUI desktop will show up. The window may maximize to your entire screen; however, you can resize it to whatever is convenient. The desktop inside will adjust to fit in. Your system is called a host and the VM within is called a guest. VirtualBox may show a couple of information popups at the top of the VM. Read the information to understand how the VM mouse and keyboard capture works. You can click the little buttons at the extreme right of the popups that have the message Do not show this message again to prevent them from showing up again. More information can be found at https:// www.virtualbox.org/manual/ch01.html#keyb_mouse_normal. Note In case the VM doesn't start at all, or you see an error message "Kernel Panic" in the VM window, you can usually solve this by enabling the virtualization settings in BIOS. See https://www.howtogeek.com/213795/how-to-enable-intel-vt-x-in-yourcomputers-bios-or-uefi-firmware/ for an example tutorial.

viii | Preface When the VM starts up for the first time, it will download the sample data and snippets for this book automatically. The following window will appear:

Figure 0.5: A screenshot displaying the first-time setup script progress

There are four launcher icons in the toolbar on top, which are shown here:

Figure 0.6: A screenshot displaying the launcher icons

About the Book | ix • The first launcher is the Root menu, which is like the Start menu of Windows. Since the guest OS is a minimal, stripped-down version, many of the programs shown there will not run. The only entry you will need to use during this book is the Log Out option.

Figure 0.7: A screenshot showing the Root menu

• The second launcher is the Thunar file manager. By default, it opens the home directory of the current user, called guest (note that the guest username has no connection to the term "guest" used in the context of virtual machines). The sample data for the chapters is in the folders Lesson1 to Lesson4. All the snippets and examples in the book material assume this location. The Snippets folder contains a subfolder for each chapter with all the exercises and activity solutions as text files.

Figure 0.8: A screenshot showing the Thunar file manager

x | Preface • The third launcher is the command-line terminal application. This is what you will need to use throughout the book. Notice that it starts in the home directory of the logged-in user, guest.

Figure 0.9: A screenshot showing the command-line terminal

• The final launcher is a text editor called Mousepad, which will be useful for viewing the snippets and writing scripts during the book:

Figure 0.10: A screenshot of the text editor

About the Book | xi Guidelines and Tips for using the VM • The desktop environment in the guest OS is called XFCE and is very similar to Windows XP. The top toolbar shows the running tasks. The windows behave just like any other desktop environment. • Within the console, you can select text with the mouse and paste it onto the command line with the middle mouse button (this is distinct from the clipboard). To copy selected text in the console to the clipboard, press Ctrl+Shift+C and to paste from the clipboard into the command line press Ctrl+Shift+V (or right-click and choose Paste). This will be useful when you try out the snippets. You can copy from the editor and paste into the command line, although it is recommended that you type them out. Be careful not to paste multiple or incomplete commands into the console, as it could lead to errors. • To shut down the guest OS, click Log Out from the Root menu to get the following dialog:

Figure 0.11: A screenshot showing the dialogue box that appears on shut down

xii | Preface • To close the VM (preserving its state) and resume later, close the VM window and choose Save the machine state. Next time the VM is started, it resumes from where it was. Usually, it would be preferable to use this option than choosing Shut down, as shown earlier.

Figure 0.12: A screenshot showing how to save your work before closing the VM

• The VM allows the guest and host OS to share the clipboard, so that text that you copy to the clipboard in the host, can be pasted into applications in the guest VM and vice versa. This is useful if you prefer to use your own editor rather than the one included in the VM. • It is strongly recommended that you close the shell window after completion of each exercise or activity, and open a fresh instance for the next.

About the Book | xiii • During the book, it is possible that you, by mistake, will end up changing the sample data (or the guest OS itself) in such a way that you cannot complete the exercises. To avoid starting from scratch, you are advised to create a snapshot of the VM after each exercise or activity is performed. This can be done by clicking Snapshots in the VirtualBox window:

Figure 0.13: A screenshot showing the Snapshots window

• Click Take to save the current state of the VM as a snapshot:

Figure 0.14: A screenshot showing how to take a snapshot

xiv | Preface You can take any number of snapshots and restore them, taking the guest OS back to the exact state as when you saved it. Note that snapshots can only be restored when the guest has been shut down. Snapshots will take up some disk space. Deleting a snapshot does not affect the current state:

Figure 0.15: A screenshot showing how to restore the OS

• You are free to customize the color scheme, fonts, and preferences of the editor and console application to suit your own tastes but be sure to take a snapshot before changing things, to avoid being left with an unusable guest OS. • If the VM somehow becomes completely unusable (which is quite unlikely), you can always delete it and repeat the setup process. • If you get logged out by mistake, log in as guest with the password packt.

Installing the Code Bundle Copy the code bundle for the class to the C:/Code folder.

About the Book | xv

Conventions Code words in text, folder names, filenames, file extensions, pathnames, user input, and example strings are shown as follows: "Navigate to the data folder inside the Lesson2 folder." A block of code is set as follows: The text typed by the user is in bold and the output printed by the system is in regular font: $ echo 'Hello World' Hello World New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Click Log Out from the Root menu."

Additional Resources The code bundle for this book is also hosted on GitHub at https://github.com/ TrainingByPackt/Command-Line-Fundamentals. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! You can also find links to the Official GNU Bash Manual and Linux man pages at https:// www.gnu.org/software/bash/manual/html_node/index.html and https://linux.die. net/man/, respectively.

1

Introduction to the Command Line Learning Objectives By the end of this chapter, you will be able to: • Describe the basics of a filesystem • Navigate a filesystem with command-line tools • Perform file management tasks using the command line • Utilize shell history and tab completion to efficiently compose commands • Utilize shell-editing shortcuts to efficiently work with the command line • Write and use wildcard expressions to manage groups of files and folders This chapter gives a brief history of the command line, explains filesystems, and describes how to get into and out of the command line.

2 | Introduction to the Command Line

Introduction Today, with the widespread use of computing devices, graphical user interfaces (GUIs) are all-pervasive and easily learned by almost anyone. However, we should not ignore one of the most powerful tools from a bygone era, which is the command-line interface (CLI). GUIs and CLIs approach user interaction from different angles. While GUIs emphasize user-friendliness, instant feedback, and visual aesthetics, CLIs target automation and repeatability of tasks, and composition of complicated task workflows that can be executed in one shot. These features result in the command line having widespread utility even today, nearly half a century since its invention. For instance, it is useful for web administrators to administer a web server via a shell command-line interface: instead of running a local CLI on your machine, you remotely control one that is running thousands of miles away, as if it were right in front of you. Similarly, it is useful for developers who create the backends of websites. This role requires them to learn how to use a command line, since they often need to replicate the web server environment on their local machine for development. Even outside the purely tech-oriented professions, almost everyone works with computers, and automation is a very helpful tool that can save a lot of time and drudgery. The CLI is specifically built to help automate things. Consider the task of a graphic designer, who downloads a hundred images from a website and resizes all of them into a standard size and creates thumbnails; a personnel manager, who takes 20 spreadsheet files with personnel data and converts all names to upper case, checking for duplicates; or a web content creator, who quickly replaces a person's name with another across an entire website's content. Using a GUI for these tasks would usually be tedious, considering that these tasks may need to be performed on a regular basis. Hence, rather than repeating these manually using specific applications, such as a download manager, photo editor, spreadsheet, and so on, or getting a custom application written, the professional in each case can use the command line to automate these jobs, consequently reducing drudgery, avoiding errors, and freeing the person to engage in the more important aspects of their job. Besides this, every new version of a GUI invalidates a lot of what you learned earlier. Menus change, toolbars look different, things move around, and features get removed or changed. It is often a re-learning exercise filled with frustration. On the other hand, much of what we learn about the command line is almost 100% compatible with the command line of 30 years ago, and will remain so for the foreseeable future. Rarely is a feature added that will invalidate what was valid before.

Command Line: History, Shells, and Terminology | 3 Everyone should use the command line because it can make life so much easier, but there is an aura of mystery surrounding the command line. Popular depictions of command-line users are stereotypical asocial geniuses. This skewed perception makes people feel it is very arcane, complex, and difficult to learn—as if it were magic and out of the reach of mere mortals. However, just like any other thing in the world, it can be learned incrementally step-by-step, and unlike learning GUI programs, which have no connection to one another, each concept or tool you learn in the command line adds up.

Command Line: History, Shells, and Terminology It is necessary for us to explore a little bit of computing history to fully comprehend the rationale behind why CLIs came into being.

History of the Command Line At the dawn of the computing age, computers were massive electro-mechanical calculators, with little or no interactivity. Stacks of data and program code in the form of punched cards would be loaded into a system, and after a lengthy execution, punched cards containing the results of the computation would be spit out by the machines. This was called batch processing (this paradigm is still used in many fields of computing even today). The essence of batch processing is to prepare the complete input dataset and the program code by hand and feed it to the machine in a batch. The computation is queued up for execution, and as soon as it finishes, the output is delivered, following which the next computation in the queue is processed. As the field progressed, the age of the teletypewriter (TTY) arrived. Computers would take input and produce human—readable output interactively through a typewriter-like device. This was the first time that people sat at a terminal and interacted continuously with the system, looking at results of their computations live. Eventually, TTYs with paper and mechanical keyboards were replaced by TTYs with text display screens and electronic keyboards. This method of interaction with a computer via a keyboard and text display device is called a command-line interface (CLI), and works as follows: 1. The system prompts the user to type a sentence (a command line). 2. The system executes the command, if valid, and prints out the results. 3. This sequence repeats indefinitely, and the user conducts their work step by step.

4 | Introduction to the Command Line In a more generic sense, a CLI is also called a REPL, which stands for Read, Evaluate, Print, Loop, and is defined as follows: 1. Read an input command from the user. 2. Evaluate the command. 3. Print the result. 4. Loop back to the first step. The concept of a REPL is seen in many places—even the flight control computer on NASA's 1998 Deep Space 1 mission spacecraft had a REPL controlled from Earth, which allowed scientists to troubleshoot a failure in real-time and prevent the mission from failing.

Command-Line Shells CLIs that interface with the operating system are called shells. As shells evolved, they went from being able to execute just one command at a time, to multiple commands in sequence, repeat commands multiple times, re-invoke commands from the past, and so on. Most of this evolution happened in the UNIX world, and the UNIX CLI remains up to date the de facto standard. There are many different CLIs in UNIX itself, which are analogous to different dialects of a language—in other words, the way they interpret commands from the user varies. These CLIs are called shells because they form a shell between the internals of the operating system and the user. There are several shells that are widely used, such as the Bourne shell, Korn shell, and C shell, to name a few. Shells for other operating systems such as Windows exist too (PowerShell and DOS). In this book, we will learn a modern reincarnation of the Bourne shell, called Bash (Bourne Again Shell), which is the most widely used, and considered the most standard. The Bash shell is part of the GNU project from the Free Software Foundation that was founded by Richard Stallman, which provides free and open source software. During this book, we will sometimes introduce common abbreviations for lengthy terms, which the students should get accustomed to.

Exploring the Filesystem | 5

Command-Line Terminology Before we can delve into the chapters, we will learn some introductory command-line terms that will come handy throughout the book. • Commands: They refer to the names that are typed to execute some function. They can be built into the shell or be external programs. Any program that's available on the system is a command. • Arguments: The strings typed after a command are called its arguments. They tell the command how to operate or what to operate on. They are typically options or names of some data resource such as a file, URL, and so on. • Switches/Options/Flags: These are arguments that typically start with a single or double hyphen and request a certain optional behavior from a command. Usually, an option has a short form, which is a hyphen followed by a single character, and a longer version of the same option, as a double hyphen followed by an entire word. The long option is easier to remember and often makes the command easier to read. Note that options are always case-sensitive. The following are some examples of switches and arguments in commands: ls -l --color --classify grep -n --ignore-case 'needle' haystack.txt 'my data.txt' In the preceding snippet, ls and grep are commands, –l, --color, –classify, -n, and --ignore-case are flags, and 'needle', haystack.txt and 'my data.txt' are arguments.

Exploring the Filesystem The space in which a command line operates is called a filesystem (FS). A lot of shell activity revolves around manipulating and organizing files; thus, learning the basics of filesystems is imperative to learning the command line. In this topic, we will learn about filesystems, and how to navigate, examine, and modify them via the shell. For regular users of computers, some of these ideas may seem familiar, but it is necessary to revisit them to have a clear and unambiguous understanding.

Filesystems The UNIX design philosophy is to represent every object on a computer as a file; thus, the main objects that we manipulate with a command line are files. There are many different types of file-like objects under UNIX, but for our purposes, we will deal with simple data files, typically ASCII text files, that are human readable.

6 | Introduction to the Command Line From this UNIX perspective, the system is accessible under what is termed a filesystem (FS). An FS is a representation of the system that's analogous to a series of nested boxes, each of which is called a directory or folder. Most of us are familiar with this folder structure, which we would have encountered when using a GUI file manager. A directory that contains another directory is called the parent of the latter. The latter is called a sub-directory of the former. On UNIX-like systems, the outermost directory is called the root directory, and each directory can contain either files or other directories in turn. Some files are not data, but rather represent devices or other resources on the system. To be concise, we will refer to folders, regular files, and special files as FS objects. Typically, every user of a system has their own distinct home directory, named after the user's name, where they store their own data. Various other directories used by the operating system, called system directories, exist on the filesystem, but we need not concern ourselves with them for the purposes of this book. For the sake of simplicity, we will assume that our entire filesystem resides on only a single disk or partition (although this is not true in general):

Figure 1.1: An illustration of an example structure of a typical filesystem

Exploring the Filesystem | 7 The notation used to refer to a location in a filesystem is called a path. A path consists of the list of directories that need to be navigated to reach some FS object. The list is separated by a forward slash, which is called a path separator. The complete location of an FS object, including its path from the root directory onward, is called a fully qualified pathname. Paths can be absolute or relative. An absolute path starts at the root directory, whereas a relative path starts at what is called the current working directory (CWD). Every process that runs on a system is started with its CWD set to some location. This includes the command-line process itself. When an FS object is accessed within the CWD, the name of the object alone is enough to refer to it. The root directory itself is represented by a single forward slash; thus, any absolute path starts with a single forward slash. The following is an example of an absolute path relative to the root directory: /home/robin/Lesson1/data/cupressaceae/juniperus/indica Special syntax is used to refer to the current, parent, and user's home directories: • ./ refers to the current directory explicitly. The CWD is implicit in many cases, but is useful when the current directory needs to be explicitly specified as an argument to some commands. For instance, the same directory that we've just seen can be expressed relative to the CWD (/home/robin, in this case) as follows: one pathname specifying ./ explicitly and one without: ./Lesson1/data/cupressaceae/juniperus/indica Lesson1/data/cupressaceae/juniperus/indica • ../ refers to the parent directory. This can be extended further, such as ../../../, and so on. For instance, the preceding directory can be expressed relative to the parent of the CWD, as follows: ../robin/Lesson1/data/cupressaceae/juniperus/indica The ../ takes us to one level up to the parent of all the user home directories, and then we go back down to robin and the rest of the path. • ~/ refers to the home directory of the current user.

8 | Introduction to the Command Line ~robin/ refers to the home directory of a user called "robin". This is a useful shorthand, because the home directory of a user could be configured to be anywhere in the filesystem. For example, macOS keeps the users' home directories in /Users, whereas Linux systems keep it in /home. Note The trailing slash symbol at the end of a directory pathname is optional. The shell does not mandate this. It is usually typed only to make it obvious that it is the name of a directory rather than a file.

Navigating Filesystems We will now look briefly at the most common commands for moving around the filesystem and examining its contents: • The cd (change directory) command changes the CWD to the path specified as its argument—if the path is non-existent, it prints an error message. Specifying just a single hyphen as the argument to cd changes the CWD to the last directory that was navigated from. • The pwd (print working directory) command simply displays the absolute path of the CWD. • The pushd and popd (push directory and pop directory) commands are used to bookmark the CWD and return to it later, respectively. They work by pushing and popping entries on to an internal directory stack, hence the names pushd and popd. Since they use a stack, you can push multiple values and pop them later in reverse order. • The tree command displays the hierarchical structure of a directory as a textbased diagram. • The ls (list) command displays the contents of one or more specified directories (by default, the CWD) in various formats.

Exploring the Filesystem | 9 • The cat (concatenate) command outputs the concatenation of the contents of the files specified to it. If only one file is specified, it simply displays the file. This is a quick way to look at a file's content, if the files are small. cat can also apply some transformations on its output, such as numbering the lines or suppressing multiple blank lines. • The less command can be used to interactively scroll through one or more files easily, search for a string, and so on. This command is called a pager (it lets text content be viewed page by page). On most systems, less is configured to be the default pager. Other commands that require a pager interface will request the default pager from the system for this purpose. Here are some of the most useful keyboard shortcuts for less: (a) The up or down and Page Up or Page Down keys scroll vertically. (b) The Enter and spacebar keys scroll down by one line and one screenful, respectively. (c) < and > or g and G characters will scroll to the beginning and end of the file, respectively. (d) / followed by a string and then Enter searches for the specified string. The occurrences are also highlighted. (e) n and N jump to the next or previous match, respectively. (f) Esc followed by u turns off the highlights. (g) h shows a help screen, with the list of shortcuts and commands that are supported. (h) q exits the application or exits the help screen if it is being shown. There are many more features for navigating, searching, and editing that less provides, which we will not cover in this basic introduction. Commonly Used Options for the Commands The following options are used with the ls command: • The -l option (which stands for long list) shows the contents with one entry per line—each column in the listing shows some specific information, namely permissions, link count, owner, group, size, and modification time, followed by the name, respectively. For the purposes of this book, we will only consider the size and the name. Information about the type of each FS object is indicated in the first character of the permissions field. For example, - for a file, and d for a directory.

10 | Introduction to the Command Line • The --reverse option sorts the entries in reverse order. This is an example of a long option, where the option is a complete word, which is easy to remember. Long options are usually aliases for short options—in this case, the corresponding short option is -r. • The --color option is used to make different kinds of FS objects display in different colors—there is no corresponding short option for this. The following options are used with the tree command: • The -d option prints only directories and skips files • The -o option writes the output to a file rather than the display • The -H option generates a formatted HTML output, and typically would be used along with -o to generate an HTML listing to serve on a website Before going ahead with the exercises, let's establish some conventions for the rest of this book. Each chapter of this book includes some test data to practice on. Throughout this book, we will assume that each chapter's data is in its own folder called Lesson1, Lesson2, and so on. In all of the exercises that follow, it is assumed that the work is in the home directory of the logged-in user (here, the user is called robin).

Exercise 1: Exploring Filesystem Contents In this exercise, we will navigate through a complex directory structure and view files using the commands learned so far. The sample data used here is a dataset of conifer trees, hierarchically structured as per botanic classification, which will be used in future activities and exercises too. 1. Open the command-line shell. 2. Navigate to the Lesson1 directory and examine the contents of the folder with the ls command: robin ~ $ cd Lesson1 robin ~/Lesson1 $ ls data data1

Exploring the Filesystem | 11 In the preceding code snippet, the part of the first line up to the $ symbol is called a prompt. The system is prompting for a command to be typed. The prompt shows the current user, in this case robin, followed by the CWD ~/Lesson1. The text shown after the command is what the command itself prints as output. Note Recall that ~ means the home directory of the current user.

3. Use the cd command to navigate to the data directory and examine its contents with ls: robin ~/Lesson1 $ cd data robin ~/Lesson1/data $ ls cupressaceae pinaceae podocarpaceae

taxaceae

Note Notice that the prompt shown afterward displays the new CWD. This is not always true. Depending on the configuration of the system, the prompt may vary, and may even be a simple $ symbol with no other information shown.

4. The ls command can be provided with one or more arguments, which are the names of files and folders to list. By default, it lists only the CWD. The following snippet can be used to view the subdirectories within the taxaceae and podocarpaceae directories: robin ~/Lesson1/data $ ls taxaceae podocarpaceae podocarpaceae/: acmopyle dacrydium lagarostrobos margbensonia podocarpus saxegothaea afrocarpus falcatifolium lepidothamnus microcachrys prumnopitys stachycarpus dacrycarpus halocarpus manoao nageia retrophyllum sundacarpus

parasitaxus pherosphaera phyllocladus

taxaceae/: amentotaxus  austrotaxus  cephalotaxus  pseudotaxus  taxus  torreya

12 | Introduction to the Command Line The dataset contains a directory for every member of the botanical families of coniferous trees. Here, we can see the top-level directories for each botanical family. Each of these has subdirectories for the genii, and those in turn for the species. 5. You can also use ls to request a long output in color, as follows: robin ~/Lesson1/data $ ls total 16 drwxr-xr-x 36 robin robin drwxr-xr-x 15 robin robin drwxr-xr-x 23 robin robin drwxr-xr-x 8 robin robin

-l --color 4096 4096 4096 4096

Aug Aug Aug Aug

20 20 20 20

14:01 14:01 14:01 14:01

cupressaceae pinaceae podocarpaceae taxaceae

6. Navigate into the taxaceae folder, and then use the tree command to visualize the directory structure at this point. For clarity, specify the -d option, which instructs it to display only directories and exclude files: robin ~/Lesson1/data $ cd taxaceae robin ~/Lesson1/data/taxaceae $ tree -d

Exploring the Filesystem | 13 You should get the following output on running the preceding command:

Figure 1.2: The directory structure of the taxaceae folder (not shown entirely)

7. cd can be given a single hyphen as an argument to jump back to the last directory that was navigated from: robin ~/Lesson1/data/taxaceae $ cd taxus robin ~/Lesson1/data/taxaceae/taxus $ cd /home/robin/Lesson1/data/taxaceae

14 | Introduction to the Command Line Observe that it prints out the absolute path of the directory it is changing to. Note The home directory is stored in /home on UNIX-based systems. Other operating systems such as Mac OS may place them in other locations, so the output of some of the following commands may slightly differ from that shown here.

8. We can move upwards in the hierarchy by using .. any number of times. Type the first command that follows to reach the home directory, which is three levels up. Then, use cd - to return to the previous location: robin ~/Lesson1/data/taxaceae $ cd ../../.. robin ~ $ cd /home/robin/Lesson1/data/taxaceae robin ~/Lesson1/data/taxaceae $ 9. Use cd without any arguments to go to the home directory. Then, once again, use cd - to return to the previous location: robin ~/Lesson1/data/taxaceae $ cd robin ~ $ cd /home/robin/Lesson1/data/taxaceae robin ~/Lesson1/data/taxaceae $ 10. Now, we will explore commands that help us navigate the folder structure, such as pwd, pushd, and popd. Use the pwd command to display the path of the CWD, as follows: robin ~/Lesson1/data/taxaceae $ pwd /home/robin/Lesson1/data/taxaceae The pwd command may not seem very useful when the CWD is being displayed in the prompt, but it is useful in some situations, for example, to copy the path to the clipboard for use in another command, or to share it with someone. 11. Use the pushd command to navigate into a folder, while remembering the CWD: robin ~/Lesson1/data/taxaceae $ pushd taxus/baccata/ ~/Lesson1/data/taxaceae/taxus/baccata ~/Lesson1/data/taxaceae

Exploring the Filesystem | 15 Use it once again, saving this location to the stack too: robin ~/Lesson1/data/taxaceae/taxus/baccata $ pushd ../sumatrana/ ~/Lesson1/data/taxaceae/taxus/sumatrana ~/Lesson1/data/taxaceae/taxus/ baccata ~/Lesson1/data/taxaceae Using it yet again, now we have three folders on the stack: robin ~/Lesson1/data/taxaceae/taxus/sumatrana $ pushd ../../../pinaceae/ cedrus/deodara/ ~/Lesson1/data/pinaceae/cedrus/deodara ~/Lesson1/data/taxaceae/taxus/ sumatrana ~/Lesson1/data/taxaceae/taxus/baccata ~/Lesson1/data/taxaceae robin ~/Lesson1/data/pinaceae/cedrus/deodara $ Notice that it prints out the list of directories that have been saved so far. Since it is a stack, the list is ordered according to recency, with the first entry being the one we just changed into. 12. Use popd to walk back down the directory stack, successively visiting the folders we saved earlier. Notice the error message when the stack is empty: robin ~/Lesson1/data/pinaceae/cedrus/deodara $ popd ~/Lesson1/data/taxaceae/taxus/sumatrana ~/Lesson1/data/taxaceae/taxus/ baccata ~/Lesson1/data/taxaceae robin ~/Lesson1/data/taxaceae/taxus/sumatrana $ popd ~/Lesson1/data/taxaceae/taxus/baccata ~/Lesson1/data/taxaceae robin ~/Lesson1/data/taxaceae/taxus/baccata $ popd ~/Lesson1/data/taxaceae robin ~/Lesson1/data/taxaceae $ popd bash: popd: directory stack empty The entries on the directory stack are added and removed from the top of the stack as pushd and popd are used, respectively.

16 | Introduction to the Command Line 13. Each of the folders for a species has a text file called data.txt that contains data about that tree from Wikipedia, which we can view with cat. Use the cat command to view the file's content, after navigating into the taxus/baccata directory: robin ~/Lesson1/data/taxaceae $ cd taxus/baccata robin ~/Lesson1/data/taxaceae/taxus/baccata $ cat data.txt The output will look as follows:

Figure 1.3: A screenshot showing a partial output of the data.txt file

Exploring the Filesystem | 17 Notice that the output from the last command scrolled outside the view rapidly. cat is not ideal for viewing large files. You can scroll through the window manually to see the contents, but this may not extend to the whole output. To view files in a more user-friendly, interactive fashion, we can use the less command. 14. Use ls to see that there is a file called data.txt, and then use the less command to view it: robin ~/Lesson1/data/taxaceae/taxus/baccata $ ls -l total 40 -rw-r--r-- 1 robin robin 38260 Aug 16 01:08 data.txt robin ~/Lesson1/data/taxaceae/taxus/baccata $ less data.txt The output is shown here:

Figure 1.4: A screenshot showing the output of the less command

In this exercise, we have practiced the basic commands used to view directories and files. We have not covered all of the options available with these commands in detail, but what we have learned so far will serve for most of our needs. Given this basic knowledge, we should be able to find our way around the entire filesystem and examine any file that we wish.

Manipulating a Filesystem So far, we have looked at commands that only examine directories and files. Now we will learn how to manipulate filesystem objects. We will not be manipulating the contents of files yet, but only their location in the filesystem.

18 | Introduction to the Command Line Here are the most common commands that are used to modify a filesystem. The commonly used options for some of these commands are also mentioned: • mkdir (make directory) creates the directory specified as its argument. It can also create a hierarchy of directories in one shot. The -p or --parents flag can be used to tell mkdir to create all the parent directories for the path if they do not exist. This is useful when creating a nested path in one shot. • rmdir (remove directory) is used to remove a directory. It only works if the directory is empty. The -p or --parents flag works similarly to how it does in mkdir. All the directories along the path that's specified are deleted if they are empty. • touch is used to create an empty file or update an existing file's timestamp. • cp (copy) is used to copy files or folders between directories. When copying directories, it can recursively copy all subdirectories, too. The syntax for this command is as follows: cp Here, is the paths of one or more files and folders to be copied, and is the path of the folder where are copied. This can be a filename, if is a single filename. The following options can be used with this command: The -r or --recursive flag is necessary when copying folders. It recursively copies all of the folder's contents to the destination. The -v or --verbose flag makes cp print out the source and destination pathname of every file it copies. • mv (move) can be used to rename an object and/or move it to another directory. Note The mv command performs both renaming and moving. However, these are not two distinct functions. If you think about it, renaming a file and moving it to a different path on the same disk are the same thing. Inherently, a file's content is not related to its name. A change to its name is not going to affect its contents. In a sense, a pathname is also a part of a file's name.

Exploring the Filesystem | 19 • rm (remove) deletes a file permanently, and can also be used to delete a directory, recursively deleting all the subdirectories. Unlike sending files to the Trashcan or Recycle Bin in a GUI interface, files deleted with rm cannot be recovered. This command has the following options: The -r or --recursive flag deletes folders recursively. The -v or --verbose flag makes rm print out the pathname of every file it deletes. The -i or --interactive=always options allows review and confirmation before each entry being deleted. Answering n rather than y to the prompts (Enter must be pressed after y or n) will either skip deleting some files or skip entire directories. -I or --interactive=once prompts only once before removing more than three files, or when removing recursively, whereas -i prompts for each and every file or directory.

Exercise 2: Manipulating the Filesystem In this exercise, we will learn how to manipulate the FS and files within it. We will modify the directories in the Lesson1 folder by creating, copying, and deleting files/ folders using the commands that we learned about previously: 1. Open a command-line shell and navigate to the directory for this lesson: robin ~ $ cd Lesson1/ robin ~/Lesson1 $ 2. Create some directories, using mkdir, that classify animals zoologically. Type the commands shown in the following snippet: robin ~/Lesson1 $ mkdir animals robin ~/Lesson1 $ cd animals robin ~/Lesson1/animals $ mkdir canis robin ~/Lesson1/animals $ mkdir canis/familiaris robin ~/Lesson1/animals $ mkdir canis/lupus robin ~/Lesson1/animals $ mkdir canis/lupus/lupus robin ~/Lesson1/animals $ mkdir leopardus/colocolo/pajeros mkdir: cannot create directory 'leopardus/colocolo/pajeros': No such file or directory

20 | Introduction to the Command Line 3. Notice that mkdir normally creates subdirectories that are only in already-existing directories, so it raises an error when we try to make leopardus/ colocolo/pajeros. Use the --parents or -p switch to overcome this error: robin ~/Lesson1/animals $ mkdir -p leopardus/colocolo/pajeros robin ~/Lesson1/animals $ mkdir --parents panthera/tigris robin ~/Lesson1/animals $ mkdir panthera/leo 4. Now, use tree to view and verify the directory structure we created: robin ~/Lesson1/animals $ tree The directory structure is shown here:

Figure 1.5: The directory structure of the animals folder

5. Now use the rmdir command to delete the directories. Try the following code snippets: robin ~/Lesson1/animals robin ~/Lesson1/animals rmdir: failed to remove robin ~/Lesson1/animals rmdir: failed to remove

$ rmdir canis/familiaris/ $ rmdir canis 'canis': Directory not empty $ rmdir canis/lupus 'canis/lupus': Directory not empty

6. Notice that it raises an error when trying to remove a directory that is not empty. You need to empty the directory first, removing canis/lupus/lupus, and then use the -p option to remove both canis/lupus and its parent, canis: robin ~/Lesson1/animals $ rmdir canis/lupus/lupus robin ~/Lesson1/animals $ rmdir -p canis/lupus

Exploring the Filesystem | 21 7. Now, use tree to view the modified directory structure, as follows: robin ~/Lesson1/animals $ tree The directory structure is shown here:

Figure 1.6: A screenshot of the output displaying the modified folder structure of the animals folder

8. Create some directories with the following commands: robin ~/Lesson1/animals $ mkdir -p canis/lupus/lupus robin ~/Lesson1/animals $ mkdir -p canis/lupus/familiaris robin ~/Lesson1/animals $ ls canis leopardus panthera 9. Create some dummy files with the touch command, and then view the entire tree again: robin robin robin robin robin robin

~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals

$ $ $ $ $ $

touch touch touch touch touch tree

canis/lupus/familiaris/dog.txt panthera/leo/lion.txt canis/lupus/lupus/wolf.txt panthera/tigris/tiger.txt leopardus/colocolo/pajeros/colocolo.txt

22 | Introduction to the Command Line The output will look as follows:

Figure 1.7: A screenshot of the output displaying the revised folder structure of the animals folder

10. Use cp to copy the dog.txt and wolf.txt files from the familiaris and lupus directories into a new directory called dogs, as follows: robin robin robin robin

~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals ~/Lesson1/animals

$ $ $ $

mkdir dogs cp canis/lupus/familiaris/dog.txt dogs/ cp canis/lupus/lupus/wolf.txt dogs/ tree

The output will look as follows:

Figure 1.8: A screenshot of the output displaying the revised folder structure of the animals folder, along with the newly copied files

Exploring the Filesystem | 23 11. Now clone the entire panthera directory into a new directory called cats using cp: robin ~/Lesson1/animals $ mkdir cats robin ~/Lesson1/animals $ cp -r panthera cats robin ~/Lesson1/animals $ tree The output will look as follows:

Figure 1.9: A screenshot of the output displaying the revised folder structure of the animals folder

24 | Introduction to the Command Line 12. Now use the --verbose option with cp to copy the files with verbose progress displayed and print the output using the tree command: robin ~/Lesson1/animals $ mkdir bigcats robin ~/Lesson1/animals $ cp -r --verbose leopardus/ panthera/ bigcats 'leopardus/' -> 'bigcats/leopardus' 'leopardus/colocolo' -> 'bigcats/leopardus/colocolo' 'leopardus/colocolo/pajeros' -> 'bigcats/leopardus/colocolo/pajeros' 'leopardus/colocolo/pajeros/colocolo.txt' -> 'bigcats/leopardus/colocolo/ pajeros/colocolo.txt' 'panthera/' -> 'bigcats/panthera' 'panthera/tigris' -> 'bigcats/panthera/tigris' 'panthera/tigris/tiger.txt' -> 'bigcats/panthera/tigris/tiger.txt' 'panthera/leo' -> 'bigcats/panthera/leo' 'panthera/leo/lion.txt' -> 'bigcats/panthera/leo/lion.txt' robin ~/Lesson1/animals $ tree bigcats The output of the tree command is shown here:

Figure 1.10: A screenshot of the output displaying the folder structure of the animals folder after a recursive directory copy

13. Now use mv to rename the animals folder to beasts: robin ~/Lesson1/animals $ cd .. robin ~/Lesson1 $ mv animals beasts robin ~/Lesson1 $ cd beasts robin ~/Lesson1/beasts $ ls bigcats canis cats dogs leopardus

panthera

Exploring the Filesystem | 25 14. Use mv to move an individual file to a different path. We move dogs/dog.txt to the CWD as fido.txt and move it back again: robin ~/Lesson1/beasts $ mv dogs/dog.txt fido.txt robin ~/Lesson1/beasts $ ls bigcats canis cats dogs fido.txt leopardus panthera robin ~/Lesson1/beasts $ mv fido.txt dogs/ 15. Use mv to relocate an entire folder. Move the whole canis folder into dogs: robin ~/Lesson1/beasts $ mv canis dogs robin ~/Lesson1/beasts $ tree dogs The revised folder structure is shown here:

Figure 1.11: A screenshot of the output displaying the folder structure of the animals folder after relocating a folder

16. Use the -v or --verbose option with mv to make it report each item being moved. In this case, there was only one file being moved, but this can be a long list: robin ~/Lesson1/beasts $ mkdir panthers robin ~/Lesson1/beasts $ mv --verbose panthera panthers renamed 'panthera' -> 'panthers/panthera' robin ~/Lesson1/beasts $ tree panthers The output is shown here:

Figure 1.12: A screenshot of the output displaying the folder structure of the animals folder after moving a folder

26 | Introduction to the Command Line 17. Use tree to view the dogs folder (before we use rm to delete it): robin ~/Lesson1/beasts $ tree dogs The output is shown here:

Figure 1.13: A screenshot of the output displaying the folder structure of the animals folder before the deletion of files

18. Delete the files one by one with rm: robin robin robin robin robin

~/Lesson1/beasts ~/Lesson1/beasts ~/Lesson1/beasts ~/Lesson1/beasts ~/Lesson1/beasts

$ $ $ $ $

rm dogs/fido.txt rm dogs/wolf.txt rm dogs/canis/lupus/familiaris/dog.txt rm dogs/canis/lupus/lupus/wolf.txt tree dogs

The output is shown here:

Figure 1.14: The folder structure of the animals folder after the deletion of files

19. Remove the complete directory structure with the -r or --recursive switch of rm: robin ~/Lesson1/beasts $ ls bigcats cats dogs leopardus panthers robin ~/Lesson1/beasts $ rm -r dogs robin ~/Lesson1/beasts $ ls bigcats cats leopardus panthers As we can see, the entire dogs directory was silently removed without warning.

Exploring the Filesystem | 27 20. Use the -i flag to remove items interactively. Each individual operation is prompted for confirmation: Note Depending on your system configuration, the prompts you see for the following command and the one in step 21 may be in a different order or worded differently. The system will prompt you for every deletion to be performed, regardless.

robin ~/Lesson1/beasts $ rm -r -i panthers rm: descend into directory 'panthers'? y rm: descend into directory 'panthers/panthera'? y rm: descend into directory 'panthers/panthera/leo'? y rm: remove regular empty file 'panthers/panthera/leo/lion.txt'? n rm: remove directory 'panthers/panthera/leo'? n rm: descend into directory 'panthers/panthera/tigris'? n robin ~/Lesson1/beasts $ ls bigcats cats leopardus panthers Now use the -I flag to remove items interactively. Confirmation is asked only a few times, and not for each file: robin ~/Lesson1/beasts $ rm -r -I bigcats rm: remove 1 argument recursively? y robin ~/Lesson1/beasts $ ls cats leopardus panthers 21. Use the -v or --verbose option to make rm report each file or directory that's removed: robin ~/Lesson1/beasts $ rm -r -v panthers/ removed 'panthers/panthera/leo/lion.txt' removed directory 'panthers/panthera/leo' removed 'panthers/panthera/tigris/tiger.txt' removed directory 'panthers/panthera/tigris' removed directory 'panthers/panthera' removed directory 'panthers/'

28 | Introduction to the Command Line 22. Now clear the entire folder we used for this exercise so that we can move on to the next lesson with a blank slate: robin ~/Lesson1/beasts $ cd .. robin ~/Lesson1 $ ls beasts data data1 robin ~/Lesson1 $ rm -r beasts robin ~/Lesson1 $ ls data data1 In this exercise, we learned how to change or extend the structure of the filesystem tree. We have yet to learn how to create and manipulate the content within files, which will be covered in future chapters.

Activity 1: Navigating the Filesystem and Viewing Files For this activity, use the conifer tree dataset that has been supplied as a hierarchy of folders representing each tree's Family, Genus, and Species. Every species has an associated text file called data.txt containing information about the species, which has been mined from a Wikipedia page. Your aim is to navigate this hierarchy via the command line and answer basic questions about certain species by looking it up the data in those text files. Navigate through the directories within the example dataset provided for this lesson and answer the following questions: 1. Provide two common names for the species Cedrus Deodara, which belongs to the Pinaceae family. 2. Look up information about Abies Pindrow in the Pinaceae family and fill in the following blank: "The name pindrow derives from the tree's name in _______". 3. How many species of the Taxus genus in the family Taxaceae are documented in this dataset? 4. How many species in total are documented in this dataset? Follow these steps to complete this activity: 1. Use the cd command to navigate to the appropriate folder and use less to read the relevant information. 2. Use the cd command to navigate to the appropriate folder and view the file with less. Use the / command to search for the phrase "derives from" and read the rest of the sentence to get the answer.

Exploring the Filesystem | 29 3. Navigate to the right folder and run the tree command, which reports the number of directories in it. Each directory is a species. 4. Navigate to the top-level data folder and run the tree command, which reports the number of files. Each file is associated with one species. The expected answers for the preceding questions are as follows: 1. Any two of the following: deodar cedar, Himalayan cedar, deodar, devdar, devadar, devadaru 2. Nepali 3. 12 4. 770 Note The solution for this activity can be found on page 270.

Activity 2: Modifying the Filesystem For this activity, you will be using the conifer tree sample dataset that is in the ~/ Lesson1/data folder. You need to collect the data for all trees from the family taxaceae and the genus torreya into one folder. Each file should be named .txt, where is the name of the species/folder. Execute the following steps to complete this objective: 1. Use the cd command to go into the Lesson1 folder and create a new folder called activity2. 2. Navigate to the folder for the genus specified and view the subfolders which represent each species. 3. Use the cp command to copy a data file from one sub-directory of the data/ taxaceae/torreya folder into the output folder. 4. Use the mv command to rename the file as per the species name. 5. Repeat steps 3 and 4 for all the species that are requested.

30 | Introduction to the Command Line The expected listing of the activity2 folder is as follows:

Figure 1.15: A screenshot of the expected listing of the activity2 folder

Note The solution for this activity can be found on page 270.

So far, we have explored the space in which a shell command-line operates. In a GUI, we deal with an abstract space of windows, menus, applications, and so on. In contrast, a CLI is based on a lower layer of the operating system, which is the filesystem. In this topic, we have learned what a filesystem is and how to navigate it, and examined its structure or looked at the contents of files in it using the command line. We also learned how to modify the FS structure and perform simple file management tasks. We learned how the shell is a way to provide precise, unambiguous, and repeatable instructions to the computer. You may have noticed the fact that most commandline tools perform just one simple function. This stems from one of the UNIX design philosophies: Do only one thing but, do it well. These small commands can be combined like the parts of a machine into constructs that can automate tasks and process data in complex ways. The focus of this topic was mainly to get familiar with the FS, the arena where most of the command-line work happens. In the next topic, we will learn how to reduce effort when composing commands, making use of several convenience features in Bash.

Shell History Recall, Editing, and Autocompletion | 31

Shell History Recall, Editing, and Autocompletion In the previous section, we have experienced the fact that we need to repeatedly type some commands, and often type out the pathnames of files and folders. Indeed, it can get quite tedious if we work with long or hard-to-spell pathnames (both of which are present in our tree dataset). To counter this, we can use a few convenient features of modern command-line shells to reduce typing effort. We will explore these useful keyboard shortcuts for the command line in this section. The GNU Bash shell uses an interface library called readline. This same interface is used by several programs (for example, gdb, python, and Node.js); hence, what you learn now applies to the CLIs of all those. The readline interface supports emacs and vi modes. The keyboard shortcuts in these modes are derived from the ones in the iconic editors of those names. Since the default is the emacs mode, we will study only that. Note When indicating shortcuts, the convention is to show a combination of the Ctrl key and another key using the caret symbol '^' with the key. For example, Ctrl + C is indicated by ^C.

Command History Recall The Bash shell retains a history of the past commands that were typed. Depending on the system configuration, anywhere from a few hundred to a few thousand commands could be maintained in the history log. Any command from the history can be brought back and re-executed (after optionally modifying it). Basic History Navigation Shortcuts History is accessed by using the following shortcuts: • The up and down arrow keys move through the command history. • Esc + < and Esc + > or Page Up and Page Down or Alt + < and Alt + > move to the first and last command in the history. The other shortcuts listed may or may not work depending on the system's configuration. • Ctrl + S and Ctrl + R let you incrementally search for a command in the history forward and backward, respectively, by typing any substring of the command.

32 | Introduction to the Command Line Navigating through the history of past commands with the up and down arrow keys or with Esc + < and Esc + > is quite straightforward. As you navigate, the command appears on the prompt, and can be executed by pressing Enter immediately, or after editing it. Note In the aforementioned shortcuts, remember that < and > implies that the Shift key is held down, since these are the secondary symbols on the keyboard.

To view the entire history, we can use the history command: robin ~ $ history An example output is shown here:

Figure 1.16: A screenshot of the output displaying the shell command history

This command can perform other tasks related to history management as well, but we will not concern ourselves with that for this book.

Shell History Recall, Editing, and Autocompletion | 33 Incremental Search This feature lets you find a command in the history that matches a few characters that you type. To perform a forward incremental search, press Ctrl + S, upon which the shell prompt changes to something like this: robin ~ $ ^S (i-search)`': When we press Ctrl + R instead, we see the following prompt: robin ~ $ ^R (reverse-i-search)`': i-search stands for incremental search. When these prompts are displayed, the shell expects a few characters that appear within a command to be typed. As they are typed, the command which matches those characters as a substring is displayed. If there is more than one command that matches the input, the list of matches can be iterated with Ctrl + R and Ctrl + S backward and forward, respectively. The incremental search happens from the point where you have currently navigated in the history (with arrow keys and so on). If there are no more matches in the given direction, the prompt changes to something similar to what is shown here: (failed reverse-i-search)`john': man join At this point, we can do the following: • Backspace the search string that was typed, to widen the range of matches and find one. • Change the search direction. We can press Ctrl + S if we were searching backward or press Ctrl + R if we were searching forward, to return to any previous match that was crossed over. • Press Esc to exit the search and accept whatever match was shown last. • Press Ctrl + G to exit the search and leave the command line empty. Note On some systems, Ctrl + S does not activate incremental search. Instead, it performs an unrelated function. To make sure it works as we require it to, type the following command once in the console before the exercises here: stty -ixon.

34 | Introduction to the Command Line Remember that the search happens relative to the current location in history, so if you start a search without navigating upward in the history, then searching forward would have no effect, since there are no commands after the current history location (that is, the present). This means that searching backward with Ctrl + R is generally the more frequently used and useful feature. Most of the time, a history search comes in handy for retyping a long command from the recent past, or for retrieving a complex command typed long ago, whose details have been forgotten. As you progress in your command-line knowledge and experience, you will find that although it is easy to compose complicated command lines when you have a certain problem to solve, it is not easy to recollect them after a long period of time has passed. Keeping this in mind, it makes sense to conserve your mental energy, and reuse old commands from history, rather than try to remember or recreate them from scratch. Indeed, it is possible to configure Bash to save your entire history infinitely so that you never lose any command that you ever typed on the shell.

Exercise 3: Exploring Shell History In this exercise, we will use the history search feature to repeat some commands from an earlier exercise. Make sure that you are in the Lesson1 directory before starting: 1.

Create a temporary directory called data2 to work with: robin ~/Lesson1 $ mkdir data2 robin ~/Lesson1 $ cd data2 robin ~/Lesson1/data2 $

2. Press Ctrl + R to start a reverse incremental search, and then type "animals". The most recent command with that string will be shown. 3. Press Ctrl + R two times to search backward until we get the command we need, and then press Enter to execute it: (reverse-i-search)`animals': mkdir animals robin ~/Lesson1/data2 $ mkdir animals robin ~/Lesson1/data2 $ cd animals 4. Find the command that created the directory for the species of the domestic dog canis/lupus/familiaris. The string familiaris is quite unique, so we can use that as a search pattern. Press Esc + < to reach the start of the history and Ctrl + S to start searching forward from that point. Type "fa" and press Ctrl + S two more times to get the command we are searching for. Finally, press Enter to execute it: (i-search)`fa': mkdir -p canis/lupus/familiaris robin ~/Lesson1/data2/animals $ mkdir -p canis/lupus/familiaris

Shell History Recall, Editing, and Autocompletion | 35 5. Repeat the same command, except change the directory to create canis/lupus/ lupus. Press the up arrow to get the same command again. Change the last word to lupus and press Enter to create the new directory: robin ~/Lesson1/data2/animals $ mkdir -p canis/lupus/lupus In this brief exercise, we have seen how to retrieve commands that we typed previously. We can move through the history linearly or search for a command, saving ourselves a lot of retyping.

Command-Line Shortcuts There are many keyboard shortcuts on Bash that let you modify an already typed command. Usually, it is more convenient to take an existing command from the history and edit it to form a new one, rather than retype everything. Navigation Shortcuts The following are some navigation shortcuts: • The left or right arrow keys, as well as Home or End work as per standard conventions. Ctrl + A and Ctrl + E are alternatives for Home and End. • Alt + F and Alt + B jump by one word forward and backward, a word being a contiguous string that consists of numbers and letters. Clipboard Shortcuts The following are some clipboard shortcuts: • Alt + Backspace cuts the word to the left of the cursor • Alt + D cuts the word to the right of the cursor, including the character under the cursor • Ctrl + W cuts everything to the left of the cursor until a whitespace character is encountered • Ctrl + K cuts everything from the cursor to the end of the line • Ctrl + U cuts everything from the cursor to the start of the line, excluding the character under the cursor • Ctrl + Y pastes what was just cut • Alt + Y cycles through the previously cut entries one by one (works only after pasting with Ctrl + Y)

36 | Introduction to the Command Line Other Shortcuts The following are some other shortcuts that may come in useful: • Alt + \ deletes all whitespace characters that are at the cursor, that is, it joins two words that are separated by whitespaces. • Ctrl + T swaps the current and previous character. This is useful to correct typos. • Alt + T swaps the current and previous word. • Ctrl + Shift + _ undoes the last keypress. • Alt + R reverts all changes to a line. This is useful to revert a command from history back to what it was originally. • Alt + U converts the characters from the cursor position until the next word boundary to uppercase. • Alt + L converts the characters from the cursor position until the next word boundary to lowercase. • Alt + C capitalizes the first letter of the word under the cursor and moves to the next word. There are several other shortcuts, but these are the most useful. It is not necessary to memorize all of these, but the navigation and cut/paste shortcuts are certainly worth learning by heart. Note The clipboard that the readline interface in Bash uses is distinct from the clipboard provided in the GUI. The two are independent mechanisms and should not be confused with each other. When you use any other command‑line interface that uses readline, for example, the Python shell, it gets its own independent clipboard.

Shell History Recall, Editing, and Autocompletion | 37

Exercise 4: Using Shell Keyboard Shortcuts In this exercise, we will try out some of the command-line shortcuts. For simplicity, we will introduce the echo command to help with this exercise. This command merely prints out its arguments without causing any side effects. The examples here are contrived to help illustrate the editing shortcuts: 1. Run the following command: robin ~/Lesson1/data2/animals $ echo one two three four five/six/seven one two three four five/six/seven 2. Press the up arrow key to get the same command again. Press Alt + B three times. The cursor ends up at five. Type "thousand" followed by a space, and press Enter to execute the edited command: robin ~/Lesson1/data2/animals $ echo one two three four thousand five/six/ seven one two three four thousand five/six/seven 3. Now use the cut and paste shortcuts as follows: press the up arrow key to get the previous command, press Alt + Backspace to cut the last word seven into the clipboard, press Alt + B twice (the cursor ends up at five), use Ctrl + Y to paste the word that we cut, type a forward slash, and finally press Enter: robin ~/Lesson1/data2/animals $ echo one two three four thousand seven/ five/six/ one two three four thousand seven/five/six/ 4. Press the up arrow key to get the previous command, press Alt + B four times (the cursor ends up at thousand), press Alt + D to cut that word (notice that an extra space was left behind), press End to go to the end of the line, use Ctrl + Y to paste the word that we cut, and press Enter to execute the command: robin ~/Lesson1/data2/animals $ echo one two three four thousand one two three four seven/five/six/thousand

seven/five/six/

38 | Introduction to the Command Line 5. Press the up arrow key to get the previous command again, press Alt + B three times (the cursor ends up at five), press Ctrl + K to cut to the end of the line, press Alt + B to go back one word (the cursor ends up at seven), use Ctrl + Y to paste the word that we cut, type a forward slash, and press Enter to execute the command: robin ~/Lesson1/data2/animals $ echo one two three four seven/ one two three four five/six/thousand/seven/

five/six/thousand/

6. Press the up arrow key to get the previous command once more, press Alt + B three times (the cursor ends up at six), press Ctrl + U to cut to the start of the line, press Alt + F to move forward one word (the cursor ends up at /thousand), press Ctrl + Y to paste the content we cut earlier, press Home and type echo, and then press the spacebar and then Enter to execute the command: robin ~/Lesson1/data2/animals $ echo sixecho one two three four thousand/seven/ sixecho one two three four five//thousand/seven/

five//

In this exercise, we have explored how to use the editing shortcuts to efficiently construct commands. With some practice, it becomes quite unnecessary to compose a command from scratch. Instead, we compose them from older ones.

Command-Line Autocompletion We all use auto-suggest on our mobile devices, but surprisingly, this feature has existed on Bash for decades. Bash provides the following context-sensitive completion when you type commands: • File and pathname completion • Command completion, which suggests the names of programs and commands • Username completion • Options completion • Customized completion for any program (many programs such as Git add their own completion logic)

Shell History Recall, Editing, and Autocompletion | 39 Completion is invoked on Bash by entering a few characters and pressing the Tab key. If there is only one possible completion, it is immediately inserted on to the command line; otherwise, the system beeps. Then, if Tab is pressed again, all the possible completions are shown. If the possible completions are too numerous, a confirmation prompt is shown before displaying them. Note Depending on the system's configuration, the number of possible command completions seen will vary, since different programs may be installed on different systems.

Exercise 5: Completing a Folder Path In this exercise, we will explore hands-on how the shell autocompletes folder paths for us: 1. Open a new command shell and return to the directory that we recreated from history in the earlier exercise: robin ~ $ cd Lesson1/data2/animals robin ~/Lesson1/data2/animals $ 2. Type cd canis/ and press Tab three times. It completes the command to cd canis/ lupus/ and shows two possible completions: robin ~/Lesson1/data2/animals $ cd canis/lupus/ familiaris/ lupus/ robin ~/Lesson1/data2/animals $ cd canis/lupus/ 3. Type f and press Tab to choose the completion familiaris: robin ~/Lesson1/data2/animals $ cd canis/lupus/familiaris/

40 | Introduction to the Command Line

Exercise 6: Completing a Command In this exercise, we will use command completion to suggest commands (after each sequence here, clear the command line with Ctrl + U or Alt + Backspace): 1. Type "les" and press Tab to produce the completion: robin ~/Lesson1/data2/animals $ less 2. Type "rmd" and press Tab to produce the completion: robin ~/Lesson1/data2/animals $ rmdir 3. If we do not type enough characters, the number of possible completions may be a large one. For instance, type "g" and press Tab twice (it beeps the first time to indicate that there is no single completion). The shell shows a confirmation prompt before showing all possible commands that start with "g", since there are too many: robin ~/Lesson1/data2/animals $ g Display all 184 possibilities? (y or n) In such cases, it is more practical to say n, because poring over so many possibilities is time-consuming, and defeats the purpose of completion.

Exercise 7: Completing a Command using Options In this exercise, we will use command completion using options to suggest the long options for commands (after each sequence here, clear the command line with Ctrl + U): 1. Type "ls --col" and press Tab to produce the completion: robin ~/Lesson1/data2/animals $ ls --color 2. Type "ls --re" and press Tab twice to produce the list of two possible completions: robin ~/Lesson1/data2/animals $ ls --re --recursive --reverse 3. Then, type "c" and press Tab to select --recursive as the completion: robin ~/Lesson1/data2/animals $ ls --recursive After performing these exercises, we have learned how the shell autocompletes text for us based on the context. The autocompletion is extensible, and many programs such as docker and git install completions for their commands, too.

Shell History Recall, Editing, and Autocompletion | 41

Activity 3: Command-Line Editing You are provided with the following list of tree species' names: 1. Pinaceae Cedrus Deodara 2. Cupressaceae Thuja Aphylla 3. Taxaceae Taxus Baccata 4. Podocarpaceae Podocarpus Alba Each line has the family, genus, and species written like this: Podocarpaceae Lepidothamnus Intermedius. You need to type out each of these entries and use command-line shortcuts to convert them into a command that prints out the path of the data.txt file associated with the species. You need to work out the most efficient way to compose a command, reducing typing effort and errors. Use the conifer tree sample data for this chapter that is in the ~/ Lesson1/data folder and follow these steps to complete this activity: 1. Navigate to the data folder. 2. Type out a line from the file manually, for example, Podocarpaceae Lepidothamnus Intermedius. 3. Use as few keystrokes as possible to generate a command that prints out the name of the file associated with that species, in this case: echo podocarpaceae/ lepidothamnus/intermedius/data.txt. 4. Repeat steps 3 and 4 for all the entries.

42 | Introduction to the Command Line You should obtain the following paths for the data.txt files for the given species: pinaceae/cedrus/deodara/data.txt cupressaceae/thuja/aphylla/data.txt taxaceae/taxus/baccata/data.txt podocarpaceae/podocarpus/alba/data.txt Note If you are typing any piece of text multiple times, you can save time by typing that only once and then using the cut and paste functionality. You might want to experiment with the behavior of the two "cut word" shortcuts for this particular case. The solution for this activity can be found on page 272.

In this topic, we have examined the more hands-on interactive facilities that commandline shells provide. Without the time-saving features of history, completion, and editing shortcuts, the command line would be very cumbersome. Indeed, some old primitive command shells from the 1980s such as MS-DOS lacked most, if not all, of these features, making it quite a challenge to use them effectively. Going forward, we will delve deeper into file management operations by utilizing a powerful concept called wildcard expansion, also known as shell globbing.

Shell Wildcards and Globbing In the preceding exercises and activities, notice that we often perform the same operation on multiple files or folders. The point of a computer is to never have to manually instruct it to do something more than once. If we perform any repeated action using a computer, there is usually some way that it can be automated to reduce the drudgery. Hence, in the context of the shell too, we need an abstraction that lets us handle a bunch of files together. This abstraction is called a wildcard.

Shell Wildcards and Globbing | 43 The term wildcard originates from card games where a certain card can substitute for whatever card the player wishes. When any command is sent to the shell, before it is executed, the shell performs an operation called wildcard expansion or globbing on each of the strings that make up the command line. The process of globbing replaces a wildcard expression with all file or pathnames that match it. Note This wildcard expansion is not performed on any quoted strings that are quoted with single or double quotes. Quoted arguments will be discussed in detail in a future chapter.

Wildcard Syntax and Semantics A wildcard is any string that contains any of the following special characters: • A ? matches one occurrence of any character. For example, ?at matches cat, bat, and rat, and every other three letter string that ends with "at". • A * matches zero or more occurrences of any character. For example, image.* matches image.png, image.jpg, image.bmp.zip, and so on. • A ! followed by a pair of parentheses containing another wildcard expands to strings that do not match the contained expression. Note The exclamation operator is an "extended glob" syntax and may not be enabled by default on your system. To enable it, the following command needs to be executed: shopt -s extglob.

There are a few more advanced shell glob expressions, but we will restrict ourselves to these most commonly used ones for now.

Wildcard Expansion or Globbing When the shell encounters a wildcard expression on the command line, it is internally expanded to all the files or pathnames that match it. This process is called globbing. Even though it looks as though one wildcard argument is present, the shell has converted that into multiple ones before the command runs.

44 | Introduction to the Command Line Note that a wildcard can match paths across the whole filesystem: • * matches all the directories and files in the current directory • /* matches everything in the root directory • /*/* matches everything exactly two levels deep from the root directory • /home/*/.bashrc matches a file named .bashrc that is in every user's home directory At this point, a warning is due: this powerful matching mechanism of wildcards can end up matching files that the user never intended if the wildcard was not specified correctly. Hence, you must exercise great care when running commands that use wildcards and modify or delete files. For safety, run echo with the glob expression to view what files it gets expanded to. Once we are sure that the wildcard is correct, we can run the actual command that affects the files. Note Since the shell expands wildcards as individual arguments, we can run into a situation where the number of arguments exceeds the limit that the system supports. We should be aware of this limitation when using wildcards.

Let's dive into an exercise and see how we can use wildcards.

Exercise 8: Using Wildcards In this exercise, we will practice the use of wildcards for file management by creating folders and moving files with specific file formats to those folders. Note Some of the commands used in this exercise produce many screenfuls of output, so we only show them partially or not at all.

Shell Wildcards and Globbing | 45 1. Open the command line shell and navigate to the ~/Lesson1/data1 folder: robin ~ $ cd Lesson1/data1 There are over 11,000 files in this folder, all of which are empty dummy files, but their names come from a set of real-world files. 2. Use a wildcard to list all the GIF files: *.gif matches every file that ends with .gif: robin ~/Lesson1/data1 $ ls *.gif The output is shown here:

Figure 1.17: A screenshot of the output displaying a list of all GIF files within the folder

3. Create a new folder named gif, and use the wildcard representing all GIF files to move all of them into that folder: robin ~/Lesson1/data1 $ mkdir gif robin ~/Lesson1/data1 $ mv *.gif gif 4. Verify that there are no GIF files left in the CWD: robin ~/Lesson1/data1 $ ls *.gif ls: cannot access '*.gif': No such file or directory 5. Verify that all of the GIFs are in the gif folder: robin ~/Lesson1/data1 $ ls gif/

46 | Introduction to the Command Line The output is shown here:

Figure 1.18: A screenshot of a partial output of the gif files within the folder

6. Make a new folder called jpeg and use multiple wildcard arguments with mv to move all JPEG files into that folder: robin ~/Lesson1/data1 $ mkdir jpeg robin ~/Lesson1/data1 $ mv *.jpeg *.jpg jpeg 7. Verify with ls that no JPEG files remain in the CWD: robin ~/Lesson1/data1 $ ls *.jpeg *.jpg ls: cannot access '*.jpeg': No such file or directory ls: cannot access '*.jpg': No such file or directory 8. List the jpeg folder to verify that all the JPEGs are in it: robin ~/Lesson1/data1 $ ls jpeg The output is shown here:

Figure 1.19: A screenshot of a partial output of the .jpeg files within the folder

Shell Wildcards and Globbing | 47 9. List all .so (shared object library) files that have only a single digit as the trailing version number: robin ~/Lesson1/data1 $ ls *.so.? The output is shown here:

Figure 1.20: A screenshot of a partial output of the .so files ending with a dot, followed by a one-character version number

10. List all files that start with "google" and have an extension; robin ~/Lesson1/data1 $ ls google*.* google_analytics.png google_cloud_dataflow.png fusion_tables.png google_maps.png google.png

google_drive.png

google_

11. List all files that start with "a", have the third character "c", and have an extension: robin ~/Lesson1/data1 $ ls a?c*.* archer.png archive_entry.h archive.h

archlinux.png

12. List all of the files that do not have the .jpg extension: robin ~/Lesson1/data1 $ ls !(*.jpg)

avcart.png

48 | Introduction to the Command Line The output is shown here:

Figure 1.21: A screenshot of a partial output of the non-.jpeg files in the folder

13. Before we conclude this exercise, get the sample data back to how it was before in preparation for the next activity. First, move the files within the jpeg and gif folders back to the current directory: robin ~/Lesson1/data1 $ mv gif/* . robin ~/Lesson1/data1 $ mv jpeg/* . Then, delete the empty folders: robin ~/Lesson1/data1 $ rm -r gif jpeg Now, having learned the basic syntax, we can write wildcards to match almost any group of files and paths, so we rarely ever need to specify filenames individually. Even in a GUI, it takes more effort than this to select groups of files in a file manager (for example, all .gifs) and this can be error-prone or frustrating when hundreds or thousands of files are involved.

Shell Wildcards and Globbing | 49

Activity 4: Using Simple Wildcards The supplied sample data in the Lesson1/data1 folder has about 11,000 empty files of various types. Use wildcards to copy each file to a directory representing its category, namely images, binaries, and misc., and count how many of each category exist. Through this activity, you will get familiar with using simple wildcards for file management. Follow these steps to complete this activity: 1. Create the three directories representing the categories specified. 2. Move all of the files with the extensions .jpg, .jpeg, .gif, and .png to the images folder. 3. Move all of the files with the extensions .a, .so, and .so, followed by a period and a version number, into the binaries folder. 4. Move the remaining files with any extension into the misc folder. 5. Count the files in each folder using a shell command. You should get the following answers: 3,674 images, 5,368 binaries, and 1,665 misc. Note The solution for this activity can be found on page 273.

Activity 5: Using Directory Wildcards The supplied sample data inside the Lesson1/data folder has a taxonomy of tree species. Use wildcards to get the count of the following: 1. The species whose family starts with the character p, and the genus has a as the second character. 2. The species whose family starts with the character p, the genus has i as the second character, and species has u as the second character. 3. The species whose family as well as genus starts with the character t.

50 | Introduction to the Command Line This activity will help you get familiar with using simple wildcards that match directories. Follow these steps to complete this activity: 1. Navigate to the data folder. 2. Use the tree command with a wildcard for each of the three conditions to get the count of species. You should get the following answers: 83 species, 26 species, and 19 species. Note The solution for this activity can be found on page 273.

Summary We have introduced a lot of material in this first chapter, which is probably quite novel to anyone approaching the command line for the first time. Even in this brief exploration, we can start to see how seemingly complicated filesystem tasks can be completed with minimal effort. In the coming chapter, we will add to our toolbox of useful shell programs that process text data. In later chapters, we will learn about the mechanisms to tie these commands together, such as piping and redirection, to perform complex data-processing tasks. We will also learn about regular expressions and shell expansion constructs that let us manipulate textual data in powerful ways.

2

Command-Line Building Blocks Learning Objectives By the end of the chapter, you will be able to: • Use redirection to control command input and output • Construct pipelines between commands • Use commands for text filtering • Use text transformation commands • Analyze tabular data using data-processing commands This chapter introduces you to the two main composing mechanisms of command lines: redirection and piping. You will also expand your vocabulary of commands to be able to perform a wide variety of data-processing tasks.

54 | Command-Line Building Blocks

Introduction So far, we have learned the basics of how to work with the filesystem with the shell. We also looked at some shell mechanisms such as wildcards and completion that simplify life in the command line. In this chapter, we will examine the building blocks that are used to perform data-processing tasks on the shell. The Unix approach is to favor small, single-purpose utilities with very well-defined interfaces. Redirection and pipes let us connect these small commands and files together so that we can compose them like the elements of an electronic circuit to perform complex tasks. This concept of joining together small units into a more complex mechanism is a very powerful technique. Most data that we typically work with is textual in nature, so we will study the most useful text-oriented commands in this chapter, along with various practical examples of their usage.

Redirection Redirection is a method of connecting files to a command. This mechanism is used to capture the output of a command or to feed input to it. Note During this section, we will introduce a few commands briefly, in order to illustrate some concepts. The commands are only used as examples, and their usage does not have any connection to the main topics being covered here. The detailed descriptions of all the features and uses of those commands will be covered in the topic on text-processing commands.

Input and Output Streams Every command that is run has a channel for data input, termed standard input (stdin), data output, termed standard output (stdout) and standard error (stderr). A command reads data from stdin and writes its results to stdout. If any error occurs, the error messages are written to stderr. These channels can also be thought of as streams through which data flows. By convention, stdin, stdout, and stderr are assigned the numbers 0, 1, and 2, which are called file descriptors (FDs). We will not go into the technical details of these, but remember the association between these streams and their FD numbers.

Redirection | 55 When a command is run interactively, the shell attaches the input and output streams to the console input and output, respectively. Note that, by default, both stdout and stderr go to the console display. Note The terms console, terminal, and TTY are often used interchangeably. In essence, they refer to the interface where commands are typed, and output is produced as text. Console or terminal output refers to what the command prints out. Console or terminal input refers to what the user types in.

For instance, when we use ls to list an existing and non-existing folder from the dataset we used in the previous chapter, we get the following output: robin ~/Lesson1/data $ ls podocarpaceae/ nonexistent ls: cannot access 'nonexistent': No such file or directory podocarpaceae/: acmopyle podocarpus

dacrydium saxegothaea

lagarostrobos

margbensonia

parasitaxus

afrocarpus prumnopitys

falcatifolium stachycarpus

lepidothamnus

microcachrys

pherosphaera

manoao

nageia

phyllocladus

dacrycarpus halocarpus retrophyllum sundacarpus

Note the error message that appears on the first line of the output. This is due to the stderr stream reaching the console, whereas the remaining output is from stdout. The outputs from both channels are combined.

Use of Operators for Redirection We can tell the shell to connect any of the aforementioned streams to a file using the following operators: • The > or greater-than symbol is used to specify output redirection to a file, and is used as follows: command >file.txt This instructs the shell to redirect the stdout of a command into a file. If the file already exists, its content is overwritten.

56 | Command-Line Building Blocks • The >> symbol is used to specify output redirection with append, appending data to a file, as follows: command >>file.txt This instructs the shell to redirect the stdout of command into a file. If the file does not exist, it is created, but if the file exists, then it gets appended to, rather than overwritten, unlike the previous case. • The < or less-than symbol is used to specify input redirection from a file. The syntax is as follows: command >, which lets us redirect the stream corresponding to that FD. If no FD is specified, the defaults are stdin (FD 0) and stdout (FD 1) for input and output, respectively. Typically, the FD prefix 2 is used to redirect stderr. Note To redirect both stdout and stderr to the same file, the special operator &> is used.

When the shell runs one of the preceding commands, it first opens the files to/from which redirection is requested, attaches the streams to the command, and then runs it. Even if a command produces no output, if output redirection is requested, then the file is created (or truncated).

Using Multiple Redirections Both input and output redirection can be specified for the same command, so a command can be considered as consisting of the following parts (of which the redirections are optional): • stdin redirection • stdout redirection • stderr redirection • The command itself, along with its options and other arguments

Redirection | 57 A key insight is that the order in which these parts appear in the command line does not matter. For example, consider the following sort commands (the sort command reads a file via stdin, sorts the lines, and writes it to stdout): sort >sorted.txt sorted.txt This is perfectly valid. However, from a conceptual level, it is convenient to think of the symbol and the file as a single command-line element, and to write it as follows: sort sorted.txt This is to emphasize that the filenames data.txt and sorted.txt are attached to the respective streams, that is, stdin and stdout. Remember that the symbol is always written first, followed by the filename. The symbol points to the direction of the data flow, which is either from or into the file.

Heredocs and Herestrings A useful convention that most commands follow is that they accept a command-line argument for the input file, but if the argument is omitted, the input data is read from stdin instead. This lets commands easily be used both with redirection (and piping) as well as in a standalone manner. For example, consider the following: less file.txt

58 | Command-Line Building Blocks This less command gets the filename file.txt passed as an argument, which it then opens and displays. Now, consider the following: less OK, enough > DONE This is some text Some more text OK, enough Note Observe the difference between steps 9 and 10. In step 9, cat processes each line that is typed and prints it back immediately. This is because the TTY (which is connected to the stdin of cat) waits for the Enter key to be pressed before it writes the complete line into the stdin of cat. Thereupon, cat outputs that line, emptying its input buffer, and goes to sleep until the next line arrives. In step 10, the TTY is connected to the shell itself, rather than to the cat process. The shell is, in turn, connected to cat and does not send any data to it until the limit string is encountered, after which the entire text that was typed goes into cat at once.

11. The bc command is an interactive calculator. We can use a herestring to make it do a simple calculation. Type the following to get the seventh power of 1234: robin ~/Lesson1/data $ bc once again this is a very long command, let me extend to the next line and then once again The backslash must be the last character of the line for this to work, and a command can be divided into as many lines as desired, with the line breaks having no effect on the command. In a similar fashion, we can enter a literal multiline string containing newlines, simply by using quotes. Although they appear like multiline commands, multiline strings do not ignore the newlines that are typed. The rules for single and double quoted strings described earlier apply for multiline strings as well. For example: robin ~ $ echo 'First line > Second line > Last line' First line Second line Last line

Filtering Commands Commands of this category operate by reading the input line by line, transforming it, and (optionally) producing an output line for each input line. They can be considered analogous to a filtering process.

Text-Processing Commands | 79 Concatenate Files: cat The cat command is primarily meant for concatenating files and for viewing small files, but it can also perform some useful line-oriented transformations on the input data. We have used cat before, but there are some options it provides that are quite useful. The long and short versions of some of these options are as follows: • -n, --number: Numbers output lines. The numbers are padded with spaces and followed by a tab. • -b, --number-nonblank: Numbers nonempty output lines, and overrides -n. • -s, --squeeze-blank: Removes repeated empty output lines. • -E, --show-ends: Displays $ at the end of each line. • -T, --show-tabs: Displays tab characters as ^I. Among the preceding options, the numbering options are particularly useful. Translate: tr The tr command works like a translator, reading the input stream and producing a translated output according to the rules specified in the arguments. The basic syntax is as follows: tr SET1 SET2 This translates characters from SET1 into corresponding ones from SET2. Note The tr command always works only on its standard input and does not take an argument for an input file.

There are three basic uses of tr that can be selected with a command-line flag: • No flag: Replaces any character that belongs to one set with the corresponding ones from another set. • -d or --delete: Removes all characters that belong to a given set. • -s or --squeeze-repeats: Elides repeated occurrences of any character that belongs to a given set, leaving only one occurrence. • -c: Uses the complement of the first set.

80 | Command-Line Building Blocks The character sets passed to tr can be passed in various ways: • With a list of characters written as a string, such as abcde • As a range such as a-z or 0-9 • As multiple ranges such as a-zA-Z • As one of the following special character classes that consist of an expression in square brackets (only the most common are listed here): (a) [:alnum:] for all letters and digits (b) [:alpha:] for all letters (c) [:blank:] for all horizontal whitespaces (d) [:cntrl:] for all control characters (e) [:digit:] for all digits (f) [:graph:] for all printable characters, not including space (g) [:lower:] for all lowercase letters (h) [:print:] for all printable characters, including space (i) [:punct:] for all punctuation characters (j) [:space:] for all horizontal or vertical whitespaces (k) [:upper:] for all uppercase letters (l) [:xdigit:] for all hexadecimal digits Note Character classes are used in many commands, so it's useful to remember the common ones.

Stream Editor: sed The sed command is a very comprehensive tool that can transform text in various ways. It could be considered a mini programming language in itself. However, we will restrict ourselves to using it for the most common function: search and replace.

Text-Processing Commands | 81 sed reads from stdin and writes transformed output to stdout based on the rules passed to it as an argument. In its basic form for replacing text in the stream, the syntax that's used is shown here: sed 'pattern' Here, pattern is a string such as s/day/night/FLAGS, which consists of several parts. In the preceding code, for example: • s is the operation that sed is to perform. s stands for substitute. • / is the delimiter which indicates that everything after this until the next delimiter is to be treated as one string. • day is the string that sed searches for. • / again is a delimiter, indicating the end of the search string. • night is the string that sed should replace the search string with. • / is again a delimiter, indicating the end of the replacement string. • FLAGS is an optional list of characters that modify how the search and replace is done. The most common characters are as follows: (a) g stands for global, which tells sed to replace all matches of the search string (the default behavior is to replace only the first). (b) i stands for case-insensitive, which tells sed to ignore case when matching. (c) A number, N, specifies that the Nth match alone should be replaced. Combining the g flag with this specifies that all matches including and after the Nth one are to be replaced. The delimiter is not mandated to be the / character. Any character can be used, as long as the same one is used at all three locations. Thus, all the following patterns are equivalent: 's#day#night#' 's1day1night1' 's:day:night:' 's day night ' 'sAdayAnightA'

82 | Command-Line Building Blocks Multiple patterns can be combined in a pattern by separating them with a semicolon. For instance, the following pattern tells sed to replace day with night and long with short: 's/day/night/ ; s/long/short/' Character classes can be used for the search string, but they need to be enclosed in an extra pair of square brackets. The reason for this will be apparent when we learn regular expressions in a later chapter. The following pattern tells sed to replace all alphanumeric characters with an asterisk symbol: 's/[[:alnum:]]/*/g' Cut Columns: cut The cut command interprets each line of its input as a series of fields and prints out a subset of those fields based on the specified flags. The effect of this is to select a certain set of columns from a file containing columnar data. The following is a partial list of the flags that can be used with cut: • -d DELIM, --delimiter=DELIM: Uses DELIM as the field delimiter (the default is the TAB character). • -b LIST, --bytes=LIST: Selects only specified bytes. • -f LIST, --fields=LIST: Selects only the fields specified by LIST and prints any line that contains no delimiter character, unless the -s option is specified. • -s, --only-delimited: Does not print lines not containing delimiters. • --complement: Complements the set of selected bytes, characters, or fields. • --output-delimiter=DELIM: When printing the output, DELIM is used as the field delimiter. By default, it uses the input delimiter. Here, the syntax of LIST is a comma-separated list of one or more of the following expressions (M and N are numbers): • N: The Nth element is selected • M-N: Elements starting from the Mth up to Nth inclusive are selected • M-: Elements starting from the Mth up to the last element are selected • -N: Elements from the beginning up to the Nth inclusive are selected

Text-Processing Commands | 83 Let's look at an example of using cut. The sample data for this chapter has a file called pinaceae.csv, which contains a list of tree species with comma-separated fields. This file has data separated by commas, with some values that are empty, and looks like this (only a few lines are shown):

Figure 2.8: View of the first few lines of the data from the pinaceae.csv file

Here, cut is used to extract data from the third column onward, using the comma character as the delimiter, and display the output with tabs as a delimiter (only a few lines are shown): robin ~/Lesson2 $ cut -s -d',' -f 3- --output-delimiter=$'\t' pinaceae.csv | less The output is as follows:

Figure 2.9: Partial output of the cut command

84 | Command-Line Building Blocks Note the usage of dollar-single-quotes to pass in the tab character to cut as a delimiter. Paste Columns from Files Together: paste paste works like the opposite of cut. While cut can extract one or more columns from a file, paste combines files that have columnar data. It does the equivalent of pasting a set of columns of data side by side in the output. The basic syntax of paste is as follows: paste filenames The preceding command instructs the command to read a line from each file specified and produce a line of output that has each of those lines combined, delimited by a tab character. Think of it like pasting files side by side in columns. The paste command has one option that is commonly used: • -d DELIMS, --delimiters=DELIMS: Uses DELIMS as field delimiters (the default is the tab character) DELIMS specifies individual delimiters for each field. For example, if it is set to XYZ, then X, Y, and Z are used as the delimiters after each column, respectively. Since paste works with multiple input files, typically it is used on its own without pipes, because we can only pipe one stream of data into a command. A combination of cut and paste can be used to reorder the columns of a file by first extracting the columns to separate files with cut, and then using paste to recombine them. Globally Search a Regular Expression and Print: grep grep is one of the most useful and versatile tools on UNIX-like systems. The basic purpose of grep is to search for a pattern within a file. This command is so widely used that the term grep is officially a verb meaning to search in the Oxford dictionary. Note A complete description of grep would be quite overwhelming. In this book, we will instead focus on the smallest useful subset of its features.

Text-Processing Commands | 85 The basic syntax of grep is as follows: grep pattern filenames The preceding command instructs the shell to search for the specified pattern within the files listed as arguments. This pattern can be any string or a regular expression. Also, multiple files can be specified as arguments. Omitting the filename argument(s) makes grep read from the stdin, as with most commands. The default action of grep is to print out the lines that contain the pattern. Here is a list of the most commonly used flags for grep: • -i, --ignore-case: Matches lines case-insensitively • -v, --invert-match: Selects non-matching lines • -n, --line-number: For every match, shows the line number in the file as a prefix • -c, --count: Only prints the number of matches per file • -w, --word-regexp: Only matches a pattern if it appears as a complete word • -x, --line-regexp: Only matches a pattern if it appears as a complete line • --color, --colour: Displays results in color on the terminal (no effect will be observed if the output is not to a TTY console) • -L, --files-without-match: Only shows the names of files that do not have a match • -l, --files-with-matches: Only shows the names of files that have a match • -m NUM, --max-count=NUM: Stops after NUM matching lines • -A NUM, --after-context=NUM: Prints NUM lines that succeed each matching line • -B NUM, --before-context=NUM: Prints NUM lines that precede each matching line • -C NUM, --context=NUM: Prints NUM lines that precede as well as NUM lines that succeed each matching line • --group-separator=STRING: When -A, -B, or -C are used, print the string instead of --- between groups of lines • --no-group-separator: When -A, -B, or -C are in use, do not print a separator between groups of lines • -R: Search all files within a folder recursively

86 | Command-Line Building Blocks For an example of how grep works, we will use the man command (which stands for manual), since it's a handy place to get a bunch of English text as test data. The man command outputs the built-in documentation for any command or common terminology. Try the following command: man ascii | grep -n --color 'the' Here, we ask man to show the manual page for ascii, which includes the ASCII code and some supplementary information. The output of that is piped to grep, which searches for the string "the" and prints the matching lines as numbered and colorized:

Figure 2.10: A screenshot displaying the output of the grep command

man uses the system pager (which is less) to display the manual, so the keyboard shortcuts are the same as less. The output that man provides for a command is called a man page. Note Students are encouraged to read man pages to learn more about any command; however, the material is written in a style more suited for people who are already quite used to the command line, so watch out for unfamiliar or complex material.

Print Unique Lines: uniq The basic function of the uniq command is to remove duplicate lines in a file. In other words, all the lines in the output are unique. The commonly used options of uniq are as follows: • -d, --repeated: Prints the lines that occur more than once, but only prints those lines once. • -D: Prints every occurrence of a line that occurs more than once. • -u, --unique: Only prints unique lines; does not print lines that have any duplicates.

Text-Processing Commands | 87 • -c, --count: Shows the number of occurrences for each line at the start of the line. • -i, --ignore-case: Compares lines case-insensitively. • -f N, --skip-fields=N: Avoids comparing the first N fields. • -s N, --skip-chars=N: Avoids comparing the first N characters. • -w N, --check-chars=N: Compares only N characters in lines. As you can see, uniq has several modes of operation, apart from the default, and can be used in many ways to analyze data. Note Note that uniq requires that the input file be sorted for it to work correctly.

Exercise 11: Working with Filtering Commands In this exercise, we will walk through some text-processing tasks using the commands we learned previously. The test data for this chapter contains three main datasets (available publicly on the internet): • Records of the percentage of land area that was agricultural, in every country (and region) for 1961-2015, with about 12,000 rows • Records of the population of every country (and region) for 1961-2015, with about 14,000 rows • Payroll records of public workers in NYC for the year 2017, with about 560,000 rows These datasets are large enough to demonstrate how well the shell can deal with big data. It is possible to efficiently process files of many gigabytes on the shell, even on limited hardware such as a small laptop. We will first do some simple tasks with the data from earlier chapters and then try some more complex commands to filter the aforementioned data. Note Many commands in this exercise and the ones to follow print many lines of data, but we will only show a few lines here for brevity's sake.

88 | Command-Line Building Blocks 1. Open the command-line shell and navigate to the data folder from the first exercise: robin ~ $ cd Lesson1/data/ robin ~/Lesson1/data $ 2. Use cat to number the lines of ls output, as follows: robin ~/Lesson1/data $ ls -l |   1 total 16   2 drwxr-xr-x 36 robin robin   3 drwxr-xr-x 15 robin robin   4 drwxr-xr-x 23 robin robin   5 drwxr-xr-x 8 robin robin

cat -n 4096 4096 4096 4096

Sep Sep Sep Sep

5 5 5 5

15:49 15:49 15:49 15:49

cupressaceae pinaceae podocarpaceae taxaceae

3. Use tr on the output of ls to transform it into uppercase using the range syntax: robin ~/Lesson1/data $ ls | tr 'a-z' 'A-Z' CUPRESSACEAE PINACEAE PODOCARPACEAE TAXACEAE 4. Use tr to convert only vowels to their uppercase form: robin ~/Lesson1/data $ ls | tr 'aeiou' 'AEIOU' cUprEssAcEAE pInAcEAE pOdOcArpAcEAE tAxAcEAE 5. Navigate to the folder ~/Lesson2 which contains the test data for this chapter: robin ~/Lesson1/data $ cd robin ~ $ cd Lesson2 6. The land.csv file contains the historical records we mentioned previously. View this file with less to understand its format: robin ~/Lesson2 $ less land.csv

Text-Processing Commands | 89 The file is in CSV format. The first line describes the field names, and the remaining lines contain data. Here is what the file looks like: Country Name,Country Code,Year,Value Arab World,ARB,1961,30.9442924784889 Arab World,ARB,1962,30.9441456790578 Arab World,ARB,1963,30.967119790024 7. Use grep to select the data for Austria, as follows: robin ~/Lesson2 $ grep -w 'Austria' p1.txt robin ~/Lesson2 $ cut -f3 -d, population.csv >p3.txt robin ~/Lesson2 $ cut -f4 -d, population.csv >p4.txt Note that we type three commands that are almost the same. In a future chapter, we will learn how to do these kinds of repetitive operations with less effort. 3. Next, let's paste them back together with two different delimiters, the forward slash and tab: robin ~/Lesson2 $ paste -d$'/\t' p1.txt p3.txt p4.txt >p134.txt robin ~/Lesson2 $ head -n5 p134.txt Country Name/Year Value Arab World/1960 92490932 Arab World/1961 95044497 Arab World/1962 97682294 Arab World/1963 100411076 4. Repeat the same steps for the land.csv file: robin ~/Lesson2 $ cut -f1 -d, land.csv >l1.txt robin ~/Lesson2 $ cut -f3 -d, land.csv >l3.txt robin ~/Lesson2 $ cut -f4 -d, land.csv >l4.txt robin ~/Lesson2 $ paste -d$'/\t' l1.txt l3.txt l4.txt >l134.txt robin ~/Lesson2 $ head -n5 l134.txt Country Name/Year Value Arab World/1961 30.9442924784889 Arab World/1962 30.9441456790578 Arab World/1963 30.967119790024 Arab World/1964 30.9765883533295

Text-Processing Commands | 103 5. Now, we have two files, where the country and year have been combined into a single field that can be used as the join key. Let's sort these files into place in preparation for join, but cut off the first line that contains the header using tail with +N: robin ~/Lesson2 $ tail -n+2 l134.txt | sort >land.txt robin ~/Lesson2 $ tail -n+2 p134.txt | sort >pop.txt 6. Now, let's join these two tables to get the population and agricultural land percentage matched on each country per year. Use -o, -e, and -a to get an outer join on the data since the data is not complete (rows are missing for some combination of countries and years). Also, tell join to ensure that the files are ordered. This helps us catch errors if we forgot to sort: robin ~/Lesson2 $ join -t$'\t' --check-order -o auto -e 'UNKNOWN' -a1 -a2 land.txt pop.txt | less Values where the data is not present are set to 'UNKNOWN'. The output will look as follows:

Figure 2.13: A screenshot displaying the matched data

104 | Command-Line Building Blocks 7. Let's move on to the payroll data again. Recall that we had extracted the names of all the workers to names.tsv earlier. Let's find out the most common names in the payroll. Use uniq to count each name, and sort in reverse with numeric sort and view the first 10 lines of the result with head: robin ~/Lesson2 $ sorted.txt

In this case, after launching sort in the background, we ran ls, and the shell informed us that Job #1 was done before the prompt appeared. If the background command had printed anything to its stdout or stderr, it would have overwritten and intermingled with the shell prompt display and the output of any other command we executed after. Note The shell has only one output device, which is the console display. This means that if multiple commands are running simultaneously, their output will get intermingled randomly. This includes the output of both stdout and stderr. Hence, it almost always makes sense to use output redirection when launching commands in the background.

116 | Advanced Command-Line Concepts The & operator can also be used with multiple commands, as shown here: sort file.txt >sorted.txt & sort file2.txt >sorted2.txt & sort file3.txt >sorted3.txt & This makes the shell launch all three sort commands concurrently, and then immediately display the prompt. Also, notice that there is an extra trailing & symbol at the end. Without this, the shell would have waited for the third command to complete before showing its prompt. & should not be considered as a separator between commands, but as a suffix after each to signify background launching. This is similar to using a full stop or exclamation mark in normal text. Logical Operators && and || The logical AND and logical OR operators are used to chain commands sequentially and control the execution of each command based on the exit code of the previous one. For example, look at the following command: grep 'unwanted' file && rm file Here, the grep command looks for the string "unwanted" within file. If it succeeds, it returns a zero, otherwise it returns a non-zero value. The && operator that follows tells the shell to execute the next command (removing the file) only if the exit code of the previous command is zero. You can interpret && in plain language as the phrase "and if it succeeds, then." && can be used to chain multiple dependent commands together, with each command executing only if the previous one succeeded. The commands themselves can be entire pipelines and include redirections, as shown here: grep -q 'useful' file && sort file | uniq >sorted.txt && echo 'File was useful' Here, the following three commands are being chained: grep -q 'useful' file sort file | uniq >sorted.txt echo 'File was useful' The second command has two commands in a pipeline. Although each command has its own exit code, the shell considers the entire pipeline's exit code to be that of the rightmost command in it. Hence, the exit code of this entire command is just the exit code of uniq.

Command Lists | 117 The purpose of && is to make sure that errors in any command abort the whole chain. When we type in commands manually, one after the other, we can check if any command had an error and stop. The && operator performs a similar function and stops executing further commands as soon as one command returns an error. The || operator works in the same manner, except that a command must fail for the next command in the chain to execute. You can interpret || in plain language as the phrase "or, if it fails, then." A chain of commands with && stops at the first command that fails, whereas a chain of commands with || stops at the first command that succeeds. Note The two operators are sometimes read as "and then" for && and "or else" for ||.

Using Multiple Operators When mixing these various operators together, the & is applied last. For example, consider the following example, which uses the sleep command. This command simply waits for the number of seconds specified as its argument (it is an easy way to simulate a long-running command): $ sleep 5 && echo Done &

robin ~ [1] 1678 robin ~

$ Done

echo 'Hello' Hello [1]+

Done

sleep 5 && echo Done

Here, the shell treats everything before the trailing & as a single unit that is launched in the background. Hence, any combination of commands with &&, ||, or ; is treated as a single unit as far as the & operator is concerned.

118 | Advanced Command-Line Concepts Notice what happened to the output of the background command (the input that we typed is in bold): we see the prompt appear immediately—the sleep command is running in the background. After 5 seconds, the sleep command exits, after which the echo command runs. Since the cursor is on the shell prompt, it prints Done at the current cursor position on the screen. It looks as if we typed Done, but actually, it is the output of echo that overwrote the screen at that point, and the cursor moved to the next line. If we type another command and execute it, the shell shows us the status of the last command before executing the new one. Therefore, as mentioned earlier, it is always recommended to redirect output streams to files for commands executed in the background to avoid garbled or confusing output. A major branch of computer science itself relates to the problem of making multiple independent and concurrent processes share a common resource (like an output device) without interfering with each other. This mixing up of command output is merely one manifestation of this basic problem. The easiest resolution is to not require a common resource (like the console) and make them independent (by redirecting the output distinct files).

Command Grouping Commands can be grouped together using curly braces {} or parentheses (). This is useful when && and || both appear. Grouped commands are treated as one unit. For example, consider the following: command1 && command2 && command3 || echo "command3 failed" The shell works by looking at each command in the list and processing it based on the return value of the previous command. Hence, if command2 fails, command3 is skipped, but the command3 failed message is printed. To work around such ambiguities, we can group the commands as follows: command1 && command2 && ( command3 || echo "command3 failed" ) Alternatively, we can group the commands like so: command1 && command2 && { command3 || echo "command3 failed" ;} Note the semicolon before the closing brace. It is required, as per the syntax rules. Also, there has to be a space before and after the braces (except if the brace is the last character on the line).

Command Lists | 119 When commands are run grouped within parentheses, a separate instance of the shell is launched to run them, called a subshell. When using braces, they are run under the same shell context. There are two main differences between running within the same shell versus running in a subshell: • Changing the directory within a subshell does not affect the CWD of the main shell that launched it • Environment variables created inside a subshell are not visible in the parent shell, but the converse is true (environment variables will be covered in a later section) Note In general, using || and && together without braces or parentheses should be avoided, as it can lead to unexpected behavior.

Exercise 13: Using Command Lists In this exercise, we will use command lists to time certain commands and search for strings in the payroll file that we used in the previous chapter: 1. Navigate to the Lesson2 folder: robin ~ $ cd ~/Lesson2 2. Let's measure the time it takes to sort a large file using the time command. This command measures the time taken by any task to complete execution. Just prefix time before any command to time it: robin ~/Lesson2 $ time sort payroll.tsv >sorted.txt real 0m1.899s user 0m7.701s sys 0m0.295s We need to consider the time under the real label, which measures the actual time spent, in other words, the time we would measure on a stopwatch. This is 1.89 seconds in this case—the value will vary depending on the speed of your computer.

120 | Advanced Command-Line Concepts The user time is a measure of how much computation was performed. It is similar to how we measure man-hours in the real world. It is the sum total of the time spent by each CPU core on this task. In this case (on a machine with 8 CPU cores), the total work done by all the cores is as much as if only one core had worked for 7.7 seconds. Since the CPU cores work in parallel, that work was actually completed in a much shorter time, namely 1.89 seconds. Note that the real time represents how much time elapsed in the real world. If the process spends a long time waiting for an external event such as user or disk input, or network data, it is possible that the real time will exceed the user time. The last label, sys time, is the amount of time the operating system was running for during this task. No matter what program is running, it has to give control to the operating system. For example, if something is written to a disk, a program requests the OS for a disk write. Until it is finished, the program is sleeping, and the OS is doing work on behalf of the program. Usually, we need not concern ourselves with sys time unless we notice that it is a very significant fraction of the real time. Note The time taken to run a command usually reduces if it is run multiple times. This is due to disk caching. Any data read from a disk is cached in memory by the OS in case a program requests it again. If the same command is run multiple times, the disk (which is hundreds of times slower than the RAM) doesn't have to be read again, thus speeding up the execution. For accurate timing, you should run a command multiple times and take the average, discarding the times for the first few runs.

3. Next, time a command list that sorts the file twice, once directly and once based on the second column in the file using the commands shown here: robin ~/Lesson2 $ time (sort payroll.tsv > sorted.txt ; sort -k2 payroll. tsv > sorted2.txt) real 0m3.753s user 0m14.793s sys 0m0.418s

Command Lists | 121 Note that it takes about twice as long as the previous command, which makes sense, since it is doing twice as much work. Also, you will notice that we placed the command list within parentheses so that the entire list is timed rather than just the first command. 4. Now, run the two sort commands concurrently and time the result, as follows: robin ~/Lesson2 $ time (sort payroll.tsv > sorted.txt & sort -k2 payroll. tsv > sorted2.txt) real 0m2.961s user 0m7.078s sys 0m0.226s As you can see from the preceding output, the two sort commands executed concurrently took slightly lesser time to execute than before. The entry under the user label specifies the cumulative time taken by each CPU core. The quotient of the user time by the real time gives a rough estimate of how many cores were being used. For the command list in step 3, the ratio is close to 4. This is because sort internally uses a parallel algorithm, using as many CPU cores as it thinks is best. In step 4, however, the ratio is closer to 2. Even though the time taken was slightly lower, it looks like only 2 cores were used. The exact reason for this kind of performance metric depends on the way sort works internally. 5. Let's try the same commands as before, but using three sorts instead: robin ~/Lesson2 $ time (sort payroll.tsv > sorted.txt ; sort -k2 payroll. tsv > sorted2.txt ; sort -k3 payroll.tsv > sorted3.txt) real 0m5.341s user 0m20.453s sys 0m0.717s The sequential version has become about 33% slower, since it's doing 33% more work than before: robin ~/Lesson2 $ time (sort payroll.tsv > sorted.txt & sort -k2 payroll. tsv > sorted2.txt & sort -k3 payroll.tsv > sorted3.txt) real 0m2.913s user 0m6.026s sys 0m0.218s The concurrent version, on the other hand, took approximately the same time, despite doing 33% more work than before.

122 | Advanced Command-Line Concepts 6. Use the && operator with grep using the -q option, which doesn't print anything but just sets the exit code. Use this to search for a string in the payroll.tsv file: robin ~/Lesson2 $ grep -q 'WALDO' payroll.tsv && echo 'Found' Found 7. Use the || operator in a similar fashion: robin ~/Lesson2 $ grep -q 'PINOCCHIO' payroll.tsv || echo 'Not Found' Not Found Note If two commands are launched concurrently, it is not necessary for them to take 50% of the time that they would take if they were performed sequentially. Even though the CPU of your system has multiple cores that run independently, the memory, disk, and some operating system mechanisms cannot be used at the same time by two processes. There is always some serial execution in most programs or tasks, which usually increases the execution time. Very few computing tasks are created so that they can be completely in parallel. You can refer to Amdahl's law, which explains why this happens.

Job Control Most of the commands we have tried so far take only a few seconds, at most, to complete, but in the real world, it is not uncommon to have tasks that run for long periods of time, which could be anywhere from minutes to hours or even weeks. In fact, some tasks never exit. They run forever, reading live data and processing it. A task or command that is running either in the foreground (the default) or background (when launched with &) is called a job. In this topic, we will learn about the mechanisms that are used to control jobs.

Job Control | 123

Keyboard Shortcuts for Controlling Jobs Various keyboard shortcuts can be used to control jobs. These shortcuts send a signal to a program, which can be of various types and have conventional names. Some of these shortcuts and their respective functionalities are discussed in the following table:

Figure 3.1: A table showing the shortcuts and their functionalities

Look at the following examples: robin ~ $ sleep 100 ^C robin ~ $ sleep 100 ^\Quit (core dumped) robin ~ $ sleep 100 ^Z [1]+

Stopped

sleep 100

We can think of Ctrl + C as a request to the program to shut down in an orderly way and Ctrl + \ as a request to pull the plug immediately. Some commands will perform some book-keeping before terminating with SIGINT to ensure that the data they output is in a consistent state.

124 | Advanced Command-Line Concepts

Commands for Controlling Jobs Now, let's look at some commands that can be used to control jobs. fg (foreground) and bg (background) When a command is suspended with Ctrl + Z, we can resume its execution with either of these commands: fg or bg. fg resumes the command in the foreground and bg resumes it in the background. When a program is resumed with bg, the effect is as if we had launched it with &. We immediately get back to the shell prompt and the command continues in the background. Resuming with fg just continues the program. We will only return to the shell prompt after it finishes its execution. bg is typically used when a command seems to take too long, and we relegate it to the background. For example: robin ~ $ sleep 100 ^Z [1]+

Stopped

sleep 100

robin ~ $ bg [1]+ sleep 100 & robin ~ $ echo This is in the foreground This is in the foreground Note that bg prints out the job number and command of the job it acts on. A typical use case of fg is to pause a long-running command, run some other commands, and then resume the original one. For example: robin ~ $ sleep 100 ^Z [2]+

Stopped

sleep 100

robin ~ $ echo This is in the foreground This is in the foreground robin ~ $ fg sleep 100

Job Control | 125 By default, fg and bg operate on the last suspended job, but we can pass an argument with the job number of the process that we wish to resume. jobs This command lists all running jobs. For example, consider the following commands and their output: robin ~ $ sleep 100 & sleep 100 & sleep 100 & [1] 7678 [2] 7679 [3] 7680 This launches the sleep command thrice in the background (as we discussed before, sleep is an easy way to simulate a long-running command). As seen before, the shell prints out the job numbers and PIDs of each. We can look at the status of the preceding commands using the jobs command: robin ~ $ jobs [1]

Running

sleep 100 &

[2]-

Running

sleep 100 &

[3]+

Running

sleep 100 &

The + and – signs displayed after the job number refer to the last and second last jobs. Note When one or more processes are suspended, the shell is in the foreground, so we can use either fg or bg to resume them. In that case, the default job that fg and bg apply to is the job associated with the + sign (as per what the jobs command shows). Similarly, you can use - with fg or bg to resume the second last job listed by jobs. When any process is running in the background, once again, we have the shell active, so the fg command can be used and the same rules for '+' and '-' apply. Obviously, we cannot use bg if any process is in the foreground, because the shell is not available to type a command when a foreground process is running. In this case, we have to first use Ctrl + Z to suspend the process and get the shell back, and then use bg.

126 | Advanced Command-Line Concepts The jobs command takes two options: • -p: This option displays only the PIDs of jobs. For the preceding example, the following output is obtained: robin ~ $ jobs -p 7678 7679 7680 • -l: This option displays the PIDs of jobs, as well as other information, including the full command line which launched the job: robin [1] [2][3]+

~ $ jobs -l 7678 Running 7679 Running 7680 Running

sleep 100 & sleep 100 & sleep 100 &

kill kill sends a signal to the process with the specified PID. Usually, it is used to terminate a process, but we can send any other signal as well. By default, it sends SIGTERM, which requests the process to terminate. Look at the following example: robin ~ $ sleep 100 & [1] 24288 robin ~ $ kill 24288 robin ~ $ jobs [1]+

Terminated

sleep 100

Job Control | 127 kill can be used to send any signal, including SIGSTOP, which will suspend the process, and SIGCONT, which resumes a process in the background (like bg). If you have a background process, you can no longer suspend it with Ctrl + Z because the shell is in the foreground, not the program. In that case, you can use SIGSTOP. The following example should make it clearer (bold text is what was typed): robin ~ $ sleep 100 ^Z [1]+

Stopped

sleep 100

robin ~ $ jobs -l [1]+ 16961 Stopped

sleep 100

robin ~ $ kill -s SIGCONT 16961 robin ~ $ jobs -l [1]+ 16961 Running

sleep 100 &

robin ~ $ kill -s SIGSTOP 16961 [1]+

Stopped

sleep 100

Here, we started a process, and suspended it with Ctrl + Z. Then, we used jobs to see the PID and sent SIGCONT to resume it in the background (we could have used bg, too). At this point, Ctrl + Z would not work to suspend it, since Ctrl + Z works only on foreground processes, but we can send SIGSTOP instead to get the same effect. Note Logically, you can only use kill if the command line is in the foreground, and you are able to type the command, meaning that the process you affect has to be in the background. However, you can open multiple shell windows on any operating system, and so you can use kill from one shell to affect a job or process inside another shell instance if you know the PID. Remember that PIDs are unique across the entire set of processes running on a system. In such a case, we can terminate a process, irrespective of whether it was in the foreground.

128 | Advanced Command-Line Concepts To end a process, we can use any of the following signals with the -s option: SIGINT, SIGTERM, SIGQUIT, or SIGKILL. A program can choose to ignore any of these signals except for SIGKILL. Typically, Ctrl + C is used to terminate a command with SIGINT, and if that doesn't work, the kill command is used. Note that Ctrl + C cannot be used with a background process. If a program doesn't get terminated with either Ctrl + C (SIGINT) or even SIGTERM, we can use SIGKILL, as follows: robin ~ $ sleep 100 & [1] 23993 robin ~ $ kill -s SIGKILL 23993 robin ~ $ jobs [1]+

Killed

sleep 100

Every signal also has a number associated with it. Usually, it's easier to remember the name, but it is worth remembering that SIGTERM is 15 and SIGKILL is 9. When specifying the signal by its number, we type the number itself as an option, for example, kill -9. We can use the signal number 9 for SIGKILL instead of typing the name: robin ~ $ sleep 100 & [1] 23996 robin ~ $ kill -9 23996 robin ~ $ jobs [1]+

Killed

sleep 100

Remember, SIGKILL is the last resort when nothing else works. If a suspended process is killed, the process dies as soon as it is resumed: robin ~ $ sleep 100 ^Z [1]+

Stopped

sleep 100

robin ~ $ jobs -l [1]+

772 Stopped

robin ~ $ kill 772 robin ~ $ fg sleep 100 Terminated

sleep 100

Regular Expressions | 129

Regular Expressions A regular expression, also called a regex (plural regexes), is a kind of pattern-matching syntax, similar to wildcards, but much more powerful. A complete description of regexes would fill many chapters, so we will restrict ourselves to a reasonable subset in this chapter. The most common use case of regexes is with the grep and sed commands, which we studied in the previous chapter. The basic operation we perform with a regex is to match it against some text: • grep can search for text matching a regex • sed can search and replace the text matching a regex with a specified replacement string Note Since the special characters in regex syntax overlap with those that the shell uses, always pass regexes in single quotes to ensure that the shell passes them literally to the command, without interpretation. Commands that accept regexes will handle escape sequences by themselves.

There are two kinds of regexes, that is, basic and extended syntax. The only difference between the two is that basic syntax requires escaping many of the special characters, which is less readable. Hence, we will learn the extended syntax here. A regex can be separated into a series of elements, each followed by an optional quantifier. Each element and quantifier is matched against the target text. A regex can also contain anchoring symbols and syntax called backreferences and subexpressions. We will learn more about these in the sections that follow. We will also look at a few examples to make the concepts that are learned here clearer. We will use grep with a herestring and the --color and -E options. The -E specifies extended regex support. Without this, grep uses basic regex syntax. For clarity, the matching characters in the output here are shown in bold. On screen, they would appear in color.

130 | Advanced Command-Line Concepts

Elements Elements, also referred to as atoms formally, are analogous to a noun in natural language. An element refers to a set of characters, any of which can match. These characters can be any of the following: A Literal Character A literal character matches just itself. You can match any single character in a regex; however, punctuation and other symbols need to be escaped. Although some symbols strictly do not need escaping, it's simpler to just escape everything unless specifically mentioned not to. A Period A single period symbol is used to match any possible character with no restrictions. Character Sets To match character sets, enclose them in square brackets. These can be defined in many ways. For instance, look at the following. • [aeiou] matches any of the vowels • [a-z] matches a range, everything from a to z • [[:alnum:]] matches the specified character class (recall character classes from the previous chapter) Multiple ranges can also be combined as follows: • [a-z1-9] matches everything from a to z or 1 to 9. • [^abc] matches any character but a, b, and c. If the first character after the [ is a ^ (caret) character, the set is inverted. • [^[:alnum:]] matches any character that is not a member of the [:alnum:] character class.

Regular Expressions | 131 Special Backslash Sequences • \s matches any whitespace character. • \S matches any non-whitespace character. • \w matches a "word character", that is, alphanumeric characters or underscores, the same as [_[:alnum:]]. • \W matches a non-word character. This is the inverse of \w, and is the same as [^_[:alnum:]]. • \< matches the beginning of a word. • \> matches the end of a word. • \b matches the end or beginning of a word. • \B matches the inverse of \b, that is, it matches a non-word boundary. Look at the following example, which is used to match multiple whitespace characters: robin ~ $ grep -E --color 'final\s+frontier' symbols, but we must escape *.

Regular Expressions | 141 15. sed can be passed the -e option multiple times. Now, write the entire replacement expression for both bold and italic: robin ~/Lesson3 $ sed -E -e 's#\*\*([a-z]+)\*\*#\1#g' -e 's#_ ([a-z]+)_#\1#g' do > echo version-$i > done version-1.0 version-0.9 version-0.8 version-0.7 version-0.6 version-0.5 The while Loop The while loop has the following syntax: while CONDITION do   COMMAND1   COMMAND2 done

182 | Shell Scripting Here, CONDITION is a conditional expression that is tested on each iteration of the loop. As long as it evaluates to true, the loop body is executed. If the condition was initially false, the body is never executed. As with the for loop, the loop body can contain multiple commands, command lists, or pipelines. Here is a practical example: robin ~ $ COUNTER=0 robin ~ $ while [[ $COUNTER -lt 5 ]] > do > echo The counter is $COUNTER > COUNTER=$((COUNTER + 1)) > done The counter is 0 The counter is 1 The counter is 2 The counter is 3 The counter is 4 The same loop can be written as a single line construct too: robin ~ $ while [[ $COUNTER -lt 5 ]]; do echo The counter is $COUNTER; COUNTER=$((COUNTER + 1)); done As with the other constructs, they are stored in history as a single line. The until Loop The syntax for until is almost the same as for while: until CONDITION do   COMMAND1   COMMAND2 done Unlike a while loop, an until loop executes its body as long as the CONDITION is false. Here's an example: robin ~ $ COUNTER=5 robin ~ $ until [[ $COUNTER -lt 0 ]] > do > echo Counter is $COUNTER

Conditionals and Loops | 183 > COUNTER=$((COUNTER-1)) > done Counter is 5 Counter is 4 Counter is 3 Counter is 2 Counter is 1 Counter is 0 As with the other constructs, this one can also be written as a single-line command, as follows: robin ~ $ until [[ $COUNTER -lt 0 ]]; do echo Counter is $COUNTER; COUNTER=$((COUNTER-1)); done We can use while and until interchangeably by negating the conditional. For instance, for the preceding example, we can negate the conditional expression used with until and use while instead, obtaining the same output: robin ~ $ COUNTER=5 robin ~ $ while [[ ! $COUNTER -lt 0 ]]; do echo Counter is $COUNTER; COUNTER=$((COUNTER-1)); done Counter is 5 Counter is 4 Counter is 3 Counter is 2 Counter is 1 Counter is 0 The choice of whether to use while or until depends on readability. If you are familiar with Boolean algebra, De Morgan's theorem can sometimes simplify an expression, and in such cases, switching from while to until or vice versa may result in a more readable conditional expression.

184 | Shell Scripting

Loop Control Sometimes, we need a loop to end as soon as some complex condition is fulfilled. At other times, we may want to skip the current iteration of the loop and move on to the next iteration. For these kinds of use cases, the following two keywords are provided: • break: This breaks out of the loop (exits the loop immediately). For instance, take a look at the following snippet, which sums up numbers in a series until the total reaches 50: robin ~ $ SUM=0 robin ~ $ for i in {1..10} > do >  SUM=$((SUM + i)) >  if [[ $SUM -gt 50 ]] >  then >   echo Done >   break >  fi >  echo Iteration $i: Sum = $SUM > done Here, we use break to print Done and break out of the loop as soon as the if conditional is true (SUM exceeds 50). The following is the output that's obtained: Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Done

1: 2: 3: 4: 5: 6: 7: 8: 9:

Sum Sum Sum Sum Sum Sum Sum Sum Sum

= = = = = = = = =

1 3 6 10 15 21 28 36 45

Conditionals and Loops | 185 • continue: This causes the loop to skip the remaining commands in the loop body. Look at the following example: robin ~ $ for i in {1..10} > do >  if [[ $i -lt 5 ]] >  then >   echo Skipped $i >   continue >  fi >  echo Iteration $i > done Here, the part of the loop body that is after continue is skipped for the first four iterations. Therefore, the following output is obtained: Skipped 1 Skipped 2 Skipped 3 Skipped 4 Iteration Iteration Iteration Iteration Iteration Iteration

5 6 7 8 9 10

In general, constructs that use break and continue are considered not very desirable, as they can lead to less readable code (this is true for all programming languages that support them), but they come in handy occasionally. For instance, they are useful when the conditions to exit the loop are too complex to express in the loop's conditional expression. Generally, tasks of a user that need automation are those that involve repeatedly doing the same thing for different items. In this section, we learned about the constructs that are necessary to let us write entire programs in the shell. A lot of repetitive tasks we did in the exercises in previous chapters could be made much simpler by using loops and conditionals. For instance, in the first chapter, we saw how wildcards let you perform one command on a set of filesystem objects. With loops and conditionals, we can repeatedly perform a complex set of commands on any set of objects that we can name. Using a computer effectively means not having to do the same thing twice manually, and in the context of the command line, the concepts we learned in this topic will help us with that. In the next topic, we will explore shell functions.

186 | Shell Scripting

Shell Functions Shell functions are very similar to functions in most programming languages. They allow us to group commands into a unit and provide them with a name. We can later execute the commands in the function by invoking its name, just like any other command. In essence, shell functions let us define our own commands that are indistinguishable from the inbuilt ones.

Function Definition Functions can be created with this basic syntax: function name() {   COMMANDS } Here, COMMANDS may be one or more commands, lists, or pipelines, and represent the function body. The braces must be separated from the rest of the syntax with whitespace, typically with newlines. When the function name is typed on the command line as if it were a command, we say that the function has been invoked or called and the commands in the function body are executed. Note The function keyword can be omitted when defining a function—it is optional, according to the syntax. You may encounter some shell scripts in the wild where this alternate style is used.

Shell Functions | 187 For example, let's create a function called stuff that prints some statements, as follows: robin ~ $ function stuff() > { > echo Doing something > echo Doing something more > echo Done > } robin ~ $ stuff Doing something Doing something more Done A redirection can also be added to the function definition after the closing brace, on the same line, as shown here: robin ~ $ function stuff() > { > echo Doing something > echo Doing something more > echo Done > } >test.txt robin ~ $ stuff robin ~ $ cat test.txt Doing something Doing something more Done Whenever this function is invoked, the output of all its commands will be automatically redirected to test.txt.

188 | Shell Scripting

Function Arguments While invoking or calling a function, arguments can be passed to it, just like with any shell command, using a syntax called positional parameters; these are special expansions that work only with functions. The following is a list of the positional parameters that can be used with functions: • $#: Expands to the number of arguments passed. This can be used to obtain the argument count, as shown here: robin ~ $ function argue() > { > echo "I have $# arguments" > } robin ~ $ argue "modus ponens" "modus tollens" I have 2 arguments In the preceding example, we used $# to count the number of arguments passed to the function while calling it, and printed the output. • $N or ${N}: For N more than 0, this expands to the Nth argument specified. Braces are mandatory if N has more than one digit. Look at the following example: robin ~ $ function greet() > { > echo "Greetings, $1!" > } robin ~ $ greet "Folks" Greetings, Folks! Here, $1 expands to the first argument of the function greet, resulting in the output shown in the preceding code. • ${*:START:COUNT}: Expands to the concatenated positional parameters in the range START to COUNT. • ${@:START:COUNT}: Expands to the positional parameters in the range START to COUNT as individual words.

Shell Functions | 189 • $*: Expands to the entire list of arguments as a single string or word. The arguments are combined with the first character of the IFS (internal field separator) variable as the delimiter (default value is space). For instance, if IFS had a comma as the first character "$*" expands to a single word "$1,$2,$3" and so on. This can be used for argument concatenation, as follows: robin ~ $ function print() > { > IFS=, > echo "$*" > } robin ~ $ print 1 2 3 4 hello 1,2,3,4,hello We will discuss the IFS variable in greater detail in a forthcoming section • $@: Expands to the entire list of arguments as separate words. Adding double quotes around it like "$@" will result in expansion to a list with each element quoted, that is, "$@" expands to separate words "$1" "$2" "$3" and so on. This form is essential if you need to pass the arguments to another command correctly. It will preserve arguments that have spaces in them as a single one. Look at the following example: robin ~ $ function show() { for i in "$@" ; do echo "|$i|"; done; } robin ~ $ show a b "c d" e |a| |b| |c d| |e| "$@" in this case expands to "a" "b" "c d" "e", that is, a single string with multiple quoted arguments. Every token in this is exactly what argument the user passed in. $@ without quotes, however, behaves differently, as shown here: robin ~ $ function show() { for i in $@ ; do echo "|$i|"; done; } robin ~ $ show a b "c d" e |a| |b| |c| |d| |e|

190 | Shell Scripting In this case, $@ expanded to $1 $2 $3 and so on when passed to the for command; the argument that had a space got split because there were no quotes (remember that quotes never get passed into a command unless escaped). This resulted in "c" and "d" becoming individual arguments. Therefore, in general, unless there is a very specific reason, always use $@ with double quotes rather than any other form. In the preceding example, if we used quotes with $*, we would get the following output: robin ~ $ function show() { for i in "$*" ; do echo "|$i|"; done; } robin ~ $ show a b "c d" e |a b c d e| Notice that $* expanded as one single quoted word with all arguments concatenated, and not multiple ones like $@. • $FUNCNAME: Expands to the name of the currently executing function. It can be used to get the function name, as shown here: robin ~ $ function groot() { echo "I am ${FUNCNAME}!" ; } robin ~ $ groot I am groot! Note Notice that we can define functions on one line, just like any other construct. The semicolon before the closing brace is mandatory, just as with braced command lists.

The shell also provides a command called shift to manipulate positional parameters. The syntax is as follows: shift N

Shell Functions | 191 Here, N is a number (defaults to 1 if unspecified). Conceptually, it "shifts" out the first N arguments from the left, after which the arguments at index 1 onward get the values of arguments N+1 onward. For example, here is a script that adds a specified HTML tag to the given text: robin ~ $ function format() > { >   tag=$1 >   echo "" >   shift >   echo "$@" >   echo "" > } This takes the first argument as the tag and uses shift to remove it so that $@ expands to the remaining arguments, which are to be the content of the tag. Let's see how this works: robin ~ $ format i Hello

Hello

robin ~ $ format b Hello World

Hello World

This is the most common use case of shift: a few arguments are processed individually and shifted out, and the rest are processed as one chunk with $@.

192 | Shell Scripting

Return Values Shell commands return an integer exit code, usually to signify an error condition. The general term in computing that's used to refer to a value returned by a function is return value. A function returns a value by using the return statement. Here is an example of a function that converts Fahrenheit to Centigrade: robin ~ $ function f_to_c() > { > f=$1 > f=$(( f - 32 )) > c=$(( (f * 5) / 9 )) > return $c > } A return statement can be executed from anywhere within a function and may optionally include the return value. We use $? to get the return value of the last function that was run, just like exit codes of commands, as shown here: robin ~ $ f_to_c 100 robin ~ $ echo $? 37 robin ~ $ f_to_c 212 robin ~ $ echo $? 100 Unlike general programming languages, the return value of a Bash function is limited to the range 0 to 255 since the mechanism was only intended to pass exit codes (each non-zero exit code represents a particular error condition). Hence, it is typically not possible to use shell functions such as mathematical ones that return numbers. For instance, for the following snippet, our function fails to produce the right answer when the return value exceeds 255: robin ~ $ f_to_c 1000 robin ~ $ echo $? 25

Shell Functions | 193 The usual workaround is to assign the result to a variable, or directly print it, so that we can capture it via redirection or command substitution: robin ~ $ function f_to_c() > { > f=$1 > f=$(( f - 32 )) > c=$(( (f * 5) / 9 )) > echo "$c" > } Now, we can see the right result and store it with command substitution: robin ~ $ f_to_c 1000 537 robin ~ $ temp=$(f_to_c 2000) robin ~ $ echo $temp 1093

Local Variables, Scope, and Recursion Shell functions provide the notion of scope, which means that within the function body, we can declare variables that are visible to that function alone, called local variables. A local variable can be declared using the local command, as follows: robin ~ $ function foo() > { >   local foovar=200 >   echo $foovar > } robin ~ $ foo 200 robin ~ $ echo $foovar

194 | Shell Scripting As we can see, foovar existed only within the foo function. After the function exits, it is gone. We would say that foovar is within the scope of foo or that foovar is a local variable in foo. Note The variables we have used so far are called global variables and are visible to all commands and functions that we invoke from the shell. In general, always declare variables as local unless it needs to be visible to the caller; this prevents subtle errors. It is also a good practice to always initialize all variables you need at the top of a function.

Since functions work just like any other command, we can call functions from within other functions. When a function calls another, the former is termed as the caller and the latter as the callee. It is possible that a callee itself further invokes another function and so on and so forth in a chain. In such as case, a function may need to look up variables in another function, if the said variable is not defined within its scope. The rules of looking up a variable from within a function follow a system called dynamic scoping. When a variable is expanded within a function, its value is looked up with dynamic scoping as follows: 1. If there is a local variable of that name assigned, then that value is used. 2. If not, then the caller of the function is checked, and if there is a local variable of that name assigned, then that value is used. 3. If not, the caller of the caller and that function's caller and so on are checked until the variable is found, or the topmost caller is reached. 4. If the variable is still not found, the global variables for a definition are checked. An example will make this clearer. Let's create a function called baz() that prints three variables: robin ~ $ function baz() > { >   echo $var1 $var2 $var3 > }

Shell Functions | 195 We have not defined any of these variables within baz(). Hence, let's define two other functions, bar() and foo(), that contain these variables' definitions. We will define the third variable globally. For illustration purposes, each variable has, as its value, the name of the scope we defined the variable in: robin ~ $ function bar() > { >   local var3='bar' >   baz > } robin ~ $ function foo() > { >   local var2='foo' >   bar > } robin ~ $ var1='global' Now, let's see what happens when we call the foo function: robin ~ $ foo global foo bar As we can see, the baz function got var1 from the global scope, var2 from the scope of foo, and var3 from the scope of bar. The chain of calls and variable scopes is as follows: • Shell calls foo (var1 is defined in the global scope) • foo calls bar (var2 is defined in the scope of foo) • bar calls baz (var3 is defined in the scope of bar)

196 | Shell Scripting We can have a variable named the same in an inner scope as an outer scope. The inner variable hides, or shadows, the outer one: robin ~ $ function bar() > { >   local var='bar' >   echo "In bar var is $var" > } robin ~ $ function foo() > { >   local var='foo' >   bar >   echo "In foo var is $var" > } We get the following output when we call the foo function: robin ~ $ foo In bar var is bar In foo var is foo Here, the var inside the bar function (the callee) hid the var that was in the scope of foo (the caller). The mechanism of Bash's scoping may seem somewhat complicated to a person encountering it for the first time. Also, it is not very likely that we would need to write functions that make a call chain many levels deep like these examples. Nevertheless, scoping is a concept that is necessary to understand if we use shell functions. Note The dynamic scoping mechanism is used by only a few well-known programming languages (Common Lisp and Perl). All other mainstream programming languages use lexical scoping, which is a completely different mechanism. Students who come from a background of other languages such as Java or Swift should be very careful not to confuse the two schemes.

Shell Functions | 197 Shell functions also support recursion. This means that you can invoke a function from within itself. This is a common idiom in many programming languages. However, the utility of recursion on the shell is quite limited. It introduces unnecessary complexity and performance penalties. The shell was not designed to be a full-scale programming language initially, and although recursion works, there is no real justification to use it in the shell. Every recursive computation can be achieved using loops in a much simplerto-understand way. We will skip any further discussion of recursion for this reason.

Exercise 16: Using Conditional Statements, Loops, and Functions In this exercise, we will be writing some functions and conditional constructs. We will not be using any data files, so the CWD where these are practiced are not relevant: 1. We will begin by writing a function called ucase() that prints its arguments in uppercase (we can use tr for this). Use the following snippet: robin ~/Lesson4 $ function ucase() > { >   tr "[a-z]" "[A-Z]"    local a=0 >   local b=1 >   local n=$1 >   local i=0 >   echo 0 >   echo 1

198 | Shell Scripting 3. Next, we will use a while loop to iterate n times. Each time, we calculate the next number of the sequence by adding the last two: >   while [[ $i -lt $n ]] >   do >     c=$(( a + b )) >     echo $c >     a=$b >     b=$c >     i=$(( i + 1 )) >   done > } Initially, a and b, which always represent the last two numbers of the sequence, are assigned the values 0 and 1 at the start of the sequence. During each iteration of the loop, c, which is the next number in the sequence, is assigned a value equal to a+b. At this point, the last two numbers of the sequence that we know are b and c, which we transfer to a and b, respectively, setting up the correct state for the next iteration around the loop. The i variable serves as a loop count to keep track of how many times we looped. When it exceeds n, the loop exits. 4. Try out this function by passing an arbitrary number as the argument, as follows: robin ~/Lesson4 $ hf 8 0 1 1 2 3 5 8 13 21 34

Shell Functions | 199 5. Now, we will write a simple function called greet() that greets a user, by taking the user's name and the current hour in 24-hour format as an argument. The script will print Good morning, Good afternoon, Good evening, or Good night, depending on the following ranges of hours: 5 to 11: Morning, 12 to 15: Afternoon, 16 to 21: Evening, and 22 to 4 (next day): Night. Note Note that when we say 5 to 11, we mean the entire time between 0500 and 1059 hours inclusive.

Use conditionals to apply the required logic in the function: robin ~/Lesson4 $ function greet() > { >   local timestring='night' >   [[ $2 -ge 5  && $2 -le 11 ]] && timestring='morning' >   [[ $2 -ge 12 && $2 -le 15 ]] && timestring='afternoon' >   [[ $2 -ge 16 && $2 -le 21 ]] && timestring='evening' >   echo "Good ${timestring}, $1!" > } 6. You can test how this function works by taking times belonging to the different ranges that were mentioned previously: robin ~/Lesson4 $ greet Good morning, Jack! robin ~/Lesson4 $ greet Good afternoon, Jill! robin ~/Lesson4 $ greet Good evening, Tom! robin ~/Lesson4 $ greet Good night, Mary!

Jack 5 Jill 12 Tom 16 Mary 22

We avoid handling the "night" case specifically. If none of the other conditions are fulfilled, it must be night. The condition for night is more complex than the others since it involves two ranges of 2200 hours to 2359 hours and 0000 hours to 0400 hours. We avoid needing to write that by using it as the default case.

200 | Shell Scripting Shell functions, along with conditional statements, loops, scoped variables, shell arithmetic, and other external tools, give us enough power to write large and complicated programs. In the following section, we will explore how to deal with user and file input in a line-oriented manner using the shell, and then learn how to write shell scripts.

Shell Line Input In the previous chapters, we learned how to process files line by line using predefined commands such as cut or tr. However, we are often limited by the fact that one command can do only one operation at a time. The shell provides some facilities to allow processing a file or typed input line by line. Some of these are discussed in the following section.

Line Input Commands These commands allow us to write scripts that work with input data line by line and process it. The read Command The shell provides the read command to process input in a line-by-line fashion. This command has two main uses: • To read input from the user interactively from scripts • To read input from a file and process it The read command accepts any number of variable names as arguments. When executed, it attempts to read text from its stdin and assign the input to the variables. For example, look at the following snippet: robin ~ $ read COLOR THING red apple robin ~ $ echo $COLOR $THING red apple

Shell Line Input | 201 Notice that read parsed the line and separated it into fields based on whitespace. The extra whitespace was ignored. Therefore, if we need to enter a space as part of a field value, we can escape it, as follows: robin ~ $ read COLOR THING dark\ green avocado robin ~ $ echo $COLOR dark green robin ~ $ echo $THING avocado After reading the second last word, read assigns the entire remaining text to the last variable, regardless of whether it is a single word or not, as shown here: robin ~ $ read SUBJECT OBJECT Jack went up the hill robin ~ $ echo $SUBJECT Jack robin ~ $ echo $OBJECT went up the hill If no variables are supplied as arguments, the line read is assigned to the REPLY variable by default: robin ~ $ read This is my reply robin ~ $ echo $REPLY This is my reply The following options can be passed to read: • -a ARR: Assigns the words read to the array variable ARR, starting at index 0. ARR is emptied first. All other arguments that are not options are ignored when this option is used. • -d DELIM: Stops reading the input as soon as the character DELIM is typed (rather than newline). This also means that the command exits as soon as that character is typed. Also, newlines get treated as whitespace.

202 | Shell Scripting • -e: Uses the readline editing behavior if the stdin is an interactive console terminal. By default, read uses a very basic editing mode. Specifying this flag allows us to use the same keyboard editing shortcuts as when we type commands in the shell. • -i TEXT: Pre-fills the input with TEXT before editing begins. This option only works if -e is also specified. • -n NUM: Reads a maximum of NUM characters. As soon as NUM characters are typed, the command exits. • -N NUM: Reads exactly NUM characters. This option ignores -d. The command also does not split the input and assigns the entire input to a single variable. • -p PROMPT: Displays PROMPT on stderr, without printing a newline, before reading input. Only applies if the input is from the console. • -r: Does not handle escape sequences. The backslash is treated like any other character, which is almost always what you need when reading actual files. • -s (silent mode): Does not show the characters as they are typed if the input is from a console. • -t TIMEOUT: Waits for TIMEOUT seconds at most to read the input. If the time limit is exceeded, the command exits and assigns whatever input was available. Timeout can be a floating-point number. The exit code of the command is non-zero if it timed out. read can be also be used with a loop to process data line by line. Let's understand this with an example: robin ~ $ while read > do > echo The line was ${#REPLY} characters long > done This is the first line The line was 22 characters long This is the next The line was 16 characters long

Shell Line Input | 203 Since no variable was passed as an argument to read, the default variable, REPLY, was assigned. The loop ends when we press Ctrl + D. Unlike the shell, read will not print ^D to symbolize that the keystroke was pressed. This example would be very difficult, if not impossible, to write as a pipeline using the wc command. We can perform the same action for file input too. Let's try this using the markdown.txt file inside the Lesson3 folder: robin ~/Lesson3 $ while read -r > do echo ${#REPLY} > done  { >   echo "I have thought of a random number between 0 and 9  try to guess it in 5 tries" >   local answer=$(random_digit) With the count variable, we will keep track of how many times the user guessed: >   local count=1 We loop until 5 tries are done and read a digit from the user: >   while [[ $count -le 5 ]] >   do If the user pressed a digit key, then we can check if they guessed right. If they guessed right, then the user wins and we exit: >     read_digit >     if [[ $guess =~ [0-9] ]] >     then >       if [[ $guess -eq $answer ]] >       then >         echo Correct answer >         return 0 >       fi If the guess was high or low, give a hint, and increase the count variable: >       [[ $guess -lt $answer ]] && echo "Your guess is lower than what I thought of, try again!" >       [[ $guess -gt $answer ]] && echo "Your guess is higher than what I thought of, try again!" >       count=$(( count + 1 ))

Shell Line Input | 211 If the user had not pressed a digit key, inform them (but we do not treat it as a guess): >     else >       echo "Please enter a digit!" >     fi >   done Once all five iterations complete, the game is over: >   echo "Sorry, you used up 5 guesses!" >   echo "The number I had thought of was $answer" > } 5. Now, test the game's script, as follows: robin ~/Lesson4 $ guess_it The output for some game interactions are shown in the following screenshot:

Figure 4.1: Output of example game interactions

212 | Shell Scripting Line input provides the final piece that allows Bash to be a general text-processing programming language. With all the constructs we have learned so far, we can start writing shell scripts.

Shell Scripts Shell scripts are text files that contain shell commands. Such files can be executed as if they were programs. A script can be in any language for which an interpreter exists, for example, Python, PHP, Perl, and so on. Similar to how a Python or Perl interpreter loads a program file and executes it, the Bash shell can load and execute shell scripts. However, before we address shell scripts, we need to visit some concepts about how programs are executed.

Shell Command Categories We will now learn about the various categories of shell commands and how scripts work like any other command. There are four types of commands that can be invoked by name from the shell. These are listed as follows: • Binary Executables: Also called executable files or binaries, these contain machine code, and provide most of the functionality of a system, for example, GUI programs such as a web browser, or CLI based programs such as grep. The Bash shell itself is an executable. The process of loading and running executables is part of the OS functionality, and not dependent on the shell. Executables that we use primarily from within the shell are called external commands. • Internal Commands: While many commands we use on the shell are independent executable files, other commands such as echo and cd do not have any binary executable behind them. The Bash shell itself performs the action involved. These are called internal commands or shell built-ins. • Shell Functions: We examined these in the previous sections. To all intents and purposes, they behave like temporary shell built-ins. They are gone if the shell window is closed. • Scripts: A script is a file containing code written for some programming language. Every script file contains a line explaining which external program is the interpreter that is to be used.

Shell Scripts | 213

Program Launch Process A simple description of how an OS launches a program is as follows: • To launch an executable, the OS is provided with either (a) the absolute path of an executable file or (b) the name of an executable. • Apart from these, a list of environment variables and their values are passed to it. This is called the environment block. • If it is only a name, the environment variable called PATH is checked. This contains a list of directories called a search path. The OS searches for the executable file that is named in each of these directories. If PATH is not specified, some system directories are checked by default. If the file is not found, this process stops here. • The OS checks whether the file specified is a binary or a script. • If it is a binary, the machine code from the executable is loaded into memory and executed. • If it is a script, the interpreter for that script is loaded and the script's absolute file path is passed as its first argument. • Both binaries and scripts can access the variables in their environment block. We can examine the PATH variable in the command line (the result will vary depending on the system), as follows: robin ~ $ echo $PATH /usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/ site_perl This same search path will be passed to the OS when an external command is launched by Bash.

Script Interpreters For a script to specify the interpreter that it requires, it has to start with a line called a shebang or a hashbang. It consists of a line beginning with the sequence #!. The remainder of the line contains information about the interpreter to be used. This is usually the full path of the interpreter. For example, a Python script may have the following shebang: #!/usr/bin/python

214 | Shell Scripting This means that when the OS launches this script, it will invoke the interpreter /usr/ bin/python and pass this script's filename to it as an argument. Most scripting languages use the # character as a comment, so they ignore this line. The hashbang line syntax with #! is only treated specially by the shell if it is the very first line of the script. Any other line starting with a # is ignored by Bash as a comment. For shell scripts, we will use the following shebang line: #!/usr/bin/env bash Usually, the shebang contains a path, but we use this instead. The env program locates where Bash is installed on the system and invokes it. We will not go into further details of how env works and why the location of Bash may vary on different systems. For the purposes of this book, all the scripts will use this shebang line. Typically, shell scripts use the file extension .sh, for example, script.sh. In this case, the script is executed as follows: /usr/bin/env bash script.sh Let's look at a simple script called test.sh that simply prints out all its arguments: robin ~ $ cat test.sh #!/usr/bin/env bash echo $* Note that blank lines are ignored in a script. For a file to be treated as an executable, it requires a file permission attribute to be set on it. Note that a complete description of file permissions attributes and their uses is beyond the scope of this book. Here, we will just explore how to add this so that our script can execute: robin ~ $ ls -l test.sh -rw-r--r-- 1 robin robin 29 Oct 27 18:39 test.sh robin ~ $ chmod u+x test.sh robin ~ $ ls -l test.sh -rwxr--r-- 1 robin robin 29 Oct 27 18:39 test.sh

Shell Scripts | 215 Notice the initial attribute string: -rw-r--r--. Look at only the three characters at the second index, rw-. This represents that the owner of this file (the current user, robin) has permissions to read and write this file. The chmod command can be used by a file's owner or a system administrator to change permissions. Here, we specify that we want the executable attribute x to be added for u, the user (the owner of the file). The chmod command has more to do with system administration than scripting, so we will not go into further details about it. The final ls command shows that we now have rwx permissions, so we can read, write, and execute this file. Now that the file is executable, we can invoke it by specifying its path: robin ~ $ ./test.sh hello from script hello from script Since it was in the current directory, we specified ./. If we merely mention the name of the script file, we get an error: robin ~ $ test.sh bash: test.sh: command not found This error occurs because the current directory is not within the path. When we write scripts that we intend to use repeatedly, we would keep them in a directory and add that directory to the PATH variable permanently so that we can run it like any other command. The shebang-based script execution mechanism has the main advantage that the OS need not maintain a list of file types and associated interpreters. Any program can be used as an interpreter by any user. For example, consider the following: robin ~ $ cat test.cat #!/usr/bin/env cat Hello world robin ~ $ chmod u+x test.cat robin ~ $ ./test.cat #!/usr/bin/env cat Hello world We created a script specifying the cat command as the interpreter. The shell invokes cat on the script file when it runs.

216 | Shell Scripting This particular example is not very useful. We created a file that ends up displaying itself when executed, but it illustrates the flexibility of this mechanism. The shell also allows us to bypass this mechanism entirely and launch scripts directly using the . built-in command, as follows: robin ~ $ ls -l arg.sh -rw-r--r-- 1 robin robin 9 Oct 28 00:30 arg.sh robin ~ $ cat arg.sh echo $@ robin ~ $ ./arg.sh Hello world bash: ./arg.sh: Permission denied robin ~ $ . arg.sh Hello world Hello world We created a file without the shebang line, and did not set the executable permissions. It could not be invoked via the regular method. However, the . command allows us to execute it anyway. This command applies only to Bash scripts, and when executed in this fashion, the commands in the script are run in the current shell rather than as a separate shell process. This command is most useful when you have a script that needs to set environment variables. Normally, scripts are run in a new shell so any variables that they set are not retained when the script exits. However, when launched with ., any variables initialized in the script will be retained after it exits. Bash also provides the source command, which is an alias for . and behaves the same. A function can stop its execution and return its exit code using the return statement. Similarly, the exit command causes a script to end and pass its exit code to its caller. The exit command actually terminates the shell process that is interpreting the script. Note Using the exit command directly on the Bash command line will cause the shell to terminate, also closing the window itself.

When functions are used in scripts, the function definition must appear before the point of its use.

Practical Case Study 1: Chess Game Extractor | 217

Practical Case Study 1: Chess Game Extractor In this section, we will incrementally develop a shell script to perform a data processing task. We have done some data crunching in the previous chapters using pipelines in a limited fashion. Here, we will attempt a more complex task. Depending on your taste, there are a number of editors available to write a script with. You may be familiar with GUI editors such as SublimeText or Notepad++, but there are several editors that work in the console itself without a GUI. A few are complex and very powerful ones such as emacs or vim, and some are simple editors such as gedit and nano. One of these is usually available on most systems. The editor can be launched right from the command line without needing to navigate the GUI Desktop with the mouse or trackpad, by just typing its name like any other command.

Understanding the Problem The functionality of this script that we want to develop is that it can take a text file containing thousands of chess games in PGN (portable game notation) textual format and extract a desired set of games from it. Generally, the way to go about it would be to create a database, import the data into it, write SQL queries, and so on. Instead, we will simply use the shell to do the job for us with a few lines of code. The advantage of this is that we do not need to do any setup or initialization, and can directly use the script on the PGN file itself. Before we start writing the script, however, let's examine the format of a PGN file. The file format was designed to be easily readable by humans. A PGN file consists of multiple chess games, each of which looks like this:

Figure 4.2: Example of a game in PGN text format

218 | Shell Scripting Each game section is followed by a newline and then another game. The game details in square brackets are not always in the same order, and the complete data for many old games is not recorded. The only information that is guaranteed to be present are the names of the players and the result. Upon observing this file, the first thing to think about is this: what information in this file is relevant to extracting one game from it? If we think a little, it should be obvious that none of the attributes really matter, nor does the data regarding the list of moves in the game. The only salient attributes we need to consider to extract the Nth game from such a file are that a game consists of several non-blank lines containing attributes, a blank line, another set of non-blank lines containing moves, followed by another blank line. The actual content of the lines is irrelevant for this initial task. Now, we can describe the logic required to solve this problem. Let N be the number of the game we want to extract: 1. We start with a count variable set to 1. 2. Then, we try to read a line of text. If we are unable to do so, we exit the script. 3. If the count is equal to N, then the line we read should be printed. 4. If the line was not blank, we need to go back to step 2 (we are reading through the attribute lines). 5. We read a blank line. 6. If the count is equal to N, the blank line is printed. 7. Next, we read a line of text. 8. If the count is equal to N, the line we read is printed. 9. If the line was not blank, we go back to step 7 (we are reading through the list of moves). 10. We read a blank line. 11. If the count is equal to N, the blank line is printed and we exit the script. 12. The count is incremented. 13. We go back to step 2. In the following exercises, we will implement the preceding logic incrementally.

Practical Case Study 1: Chess Game Extractor | 219

Exercise 18: Chess Game Extractor – Parsing a PGN File In the logic described in the previous section, we have a common pattern repeated twice: read a set of non-blank lines followed by a blank line, and then print them if a certain condition is true. In this exercise, we will implement that common pattern as a script and test it: 1. Open your text editor and add the mandatory hashbang, as shown here: #!/usr/bin/env bash 2. Next, we need to define a regular expression that can match blank lines (a blank line is a line containing 0 or more whitespace characters). Use the following snippet: regex_blank="^[[:space:]]*$" 3. Now, use the following while loop to read each line of the file, and print the line if the first parameter equals 1. If the line read is a blank line, we need to break out of the loop: while read -r line do   [[ $1 -eq 1 ]] && echo "$line"   [[ "$line" =~ $regex_blank ]] && break done

220 | Shell Scripting The default behavior of the read command is to remove any trailing and leading whitespace from the lines. This behavior can be overridden by setting IFS to a blank value (this is not the same as unsetting it). This may be significant if you process a file where leading or trailing whitespace has some significance. Save the file as pgn_extract1.sh. Note You can find files with the same name along with the code bundle for this book and on GitHub. The script files includes several comments. Any line that starts with # is ignored by the shell (and most other scripting languages). Note that adding comments is a good practice, but the comments should be more about a higher-level meaning than just describing what the code literally does at each point. It is also recommended to use indentation and blank lines to make the code more readable. You are the person most likely to be reading your own code in the future (usually after you have forgotten everything about it), so be kind to your future self and write code neatly and carefully.

4. Let's now move forward to testing our script on the command line. Once again, we should analyze what aspects need to be tested before diving in. The basic test that this script has to pass is the following: keep reading non-blank lines until a blank line is read and print them all if an argument is specified. Clearly, the number of non-blank lines is irrelevant to the test. If it works for 1 line and 2 lines, it has to work for N lines since the while loop does the same thing every time. Hence, we can try three test cases, passing 0 or 1 as the first argument for each case. When we pass 1, we expect the script to just print out each line that we type, but when we pass 0, the script should just silently ignore it. Let's test this by entering a single blank line first. Launch the script with 0 as an argument and just press Enter to input a blank line: robin ~/Lesson4 $ ./pgn_extract1.sh 0 When 0 is passed, we expect the script to not print anything.

Practical Case Study 1: Chess Game Extractor | 221 Now, launch the script with 1 as an argument and input a blank line again. Now, the script should just echo the blank line you typed: robin ~/Lesson4 $ ./pgn_extract1.sh 1

5. Repeat the same two tests, but this time, instead of just one blank line, type one non-blank line, followed by one blank line. We expect the script to be silent when 0 is passed and just echo what we typed when 1 is passed: robin ~/Lesson4 $ ./pgn_extract1.sh 0 Line 1 robin ~/Lesson4 $ ./pgn_extract1.sh 1 Line 1 Line 1

6. Repeat the same thing once more, this time with two non-blank lines followed by one blank line: robin ~/Lesson4 $ ./pgn_extract1.sh 0 Line 1 Line 2 robin ~/Lesson4 $ ./pgn_extract1.sh 1 Line 1 Line 1 Line 2 Line 2 The script will never exit unless a blank line is encountered, so that part of the test is implicit. We need not test various types of blank lines because the regex definition is correct by observation: (a) Match start of line, (b) Match 0 or more whitespace characters, and (c) Match end of line.

222 | Shell Scripting For this simple case, manual testing is enough, but in general, we should also automate the testing itself. Ideally, we would create three test files, each one testing one of the preceding cases, and write another script to call this one for each input. We would also need to make sure the output was what we expected, typically by creating three files corresponding to the output we expect, and then comparing that with what the script generates as output. Paradoxically, sometimes, the testing process is more complicated than the code being tested itself. Also, if the test script itself is wrong, we have a "Turtles all the way down" situation where we would need tests for tests, and tests for those tests, and so on ad infinitum. Testing and verification of correctness is an unsolved problem in computer science. There is no recipe to always write correct programs, nor a recipe to detect all incorrect ones. Only a few general rules of thumb exist, which are not specific to any particular flavor of programming: • Think about the problem before writing code. • Write code in small modular chunks that can be verified independently. • Comment your code. Often, the act of describing the intended functionality in plain language will highlight errors in the mind. • Have your code reviewed by others. Many heads are better than one.

Exercise 19: Chess Game Extractor – Extracting a Desired Game In this exercise, we will convert the code we wrote into a function and then write a complete script that can extract the Nth game in a file. The value of N is passed as the argument to the script: 1. Our game extraction process involves doing what the code we wrote earlier does, but twice per game, as per our original plan of action. If we ever need to do anything twice, it should be a function. Hence, let's define a function called read_chunk containing the same code as in the previous exercise, as shown in the following code. Open your text editor and input the following lines of code: function read_chunk() {   while read -r line   do     [[ $1 -eq 1 ]] && echo "$line"     [[ $line =~ $regex_blank ]] && return 0

Practical Case Study 1: Chess Game Extractor | 223   done   return 1 } Note that we made a small change to the code from the previous exercise. Our original logic requires us to know if this function succeeded in reading the first line or not. Hence, instead of using break, we will use a successful return code to indicate that the final blank line was read. Also, if the while loop exits, it means that the file has been read fully, and we need to return a non-zero error code, so we have added the line return 1 before the closing brace. Note Sometimes, it makes sense to split a larger piece of code into functions, even if those functions are not used more than once. The longer a function, the higher the chance that some bug exists in that code. The rule of thumb is that a function should fit on an 80 x 25 screen.

2. Now, we can move on to implementing the actual game extractor. We start by declaring two variables: count represents the number of games we have read so far, and should_print is a flag telling us whether the current game being read is the one desired by the user: count=1 should_print=0 3. We loop through the lines of the PGN file as long as count has not exceeded the argument passed by the user: while [[ $count -le $1 ]] do 4. If the current game is the one requested by the user, set the should_print flag to 1:   [[ $count -eq $1 ]] && should_print=1

224 | Shell Scripting 5. Read the first chunk of data (the game attributes) passing the should_print flag. If it is 1, then this game's data must be printed. If the read_chunk function fails, it means that we do not have any more data in the file, and we exit the script. If it succeeds, we need to read through and print the second chunk of the game (the moves list):   read_chunk $should_print || exit   read_chunk $should_print 6. Finally, we increment the count, and we exit the script if the game that was desired was just printed. We do not have to read any more data, and we are done:   count=$(( count + 1 ))   [[ $should_print -eq 1 ]] && exit done Save the complete script to pgn_extract2.sh. 7. We can test our script using a smaller input file, test.pgn (provided within the Lesson4 folder), as follows: robin ~/Lesson4 $ ./pgn_extract2.sh 2 /dev/null robin ~/Lesson4 $ echo $? 0 robin ~/Lesson4 $ if ls nonexistent &>/dev/null; then echo Succeeded; else echo Failed; fi Failed robin ~/Lesson4 $ if ls &>/dev/null; then echo Succeeded; else echo Failed; fi Succeeded Note the use of &>/dev/null to make ls completely silent. The exit code 2 signifies "File or object not found" error according to the UNIX error codes list. On the other hand, in most programming languages that originate from C (including the arithmetic expansion syntax), false is zero and true is non-zero: robin ~/Lesson4 $ echo $(( 1 < 2 )) 1 robin ~/Lesson4 $ echo $(( 1 < 0 )) 0 In this case, the logic is the opposite of what we saw with the command's exit codes. However, the if and other conditional statements will do the right thing if (( EXPR )) or let is used: robin ~/Lesson4 $ (( 1 < 2 )) && echo "One is less than Two" One is less than Two robin ~/Lesson4 $ (( 2 < 1 )) && echo "Two is less than One"

Tips and Tricks | 243 robin ~/Lesson4 $ robin ~/Lesson4 $ if (( 1 < 2 )) ; then echo "One is less than Two"; fi One is less than Two Using (( EXPR )) for arithmetic operations as well as tests can sometimes be preferable to using the [[ $VAR1 -op $VAR2 ]] syntax. Octal, Hexadecimal, and Other Bases In the C language, a number that starts with "0x" is treated as a hexadecimal (base 16) number that uses the characters 0-9 and a-f to represent 0 to 15. Numbers that start with a leading "0" are treated as octal (base 8), which uses only the digits 0 to 7. We can force a number to be interpreted as a particular base by prefixing it with N#, where N is the base. For example: robin ~/Lesson4 $ echo $(( 09 * 09 )) bash: 09: value too great for base (error token is "09") robin ~/Lesson4 $ echo $(( 10#09 * 10#09 )) 81 In the first case, 09 was treated as an octal number, so 9 is an invalid digit. In the second case, we force base 10 evaluation.

Declaring Typed Variables We can set the type of a variable to be an integer, read-only, or to be uppercase by default. Integer variables can be initialized using an arithmetic expression (note that the * is not interpreted as a wildcard): robin ~/Lesson4 $ declare –i num robin ~/Lesson4 $ num=9*8*7 robin ~/Lesson4 $ echo $num 504

244 | Shell Scripting Uppercase variables always get converted to capitals when expanded: robin ~/Lesson4 $ declare –u name robin ~/Lesson4 $ name=robin robin ~/Lesson4 $ echo $name ROBIN Read-only variables can only be assigned once and cannot be changed or unset: robin ~/Lesson4 $ declare –r C=300000000 robin ~/Lesson4 $ echo $C 300000000 robin ~/Lesson4 $ C=0 bash: C: readonly variable robin ~/Lesson4 $ unset C bash: unset: C: cannot unset: readonly variable There are some more aspects of the declare command; however, these are not covered in this book.

Numeric for Loops The for loop allows an alternate syntax and semantics very similar to the ones found in C, C++, C#, Java, JavaScript, and other "curly brace" languages: for ((INIT; TEST; INCREMENT)) do   COMMANDS done Here, INIT, TEST, and INCREMENT are arithmetic expressions. The for loop works as follows: 1. It evaluates INIT. 2. It evaluates TEST as a Boolean. If it evaluates to non-zero, the commands in the loop body are executed, and if not, the loop is ended immediately. 3. It evaluates INCREMENT.

Tips and Tricks | 245 The following is an example of the multiline form of this loop: robin ~/Lesson4 $ for ((x = 0 ; x < 5 ; x++)) > do >   echo "Counter: $x" > done Counter: 0 Counter: 1 Counter: 2 Counter: 3 Counter: 4 The single-line form of this loop is shown here: robin ~/Lesson4 $ for ((x = 0 ; x < 5 ; x++)) ; do echo "Counter: $x"; done Counter: 0 Counter: 1 Counter: 2 Counter: 3 Counter: 4 We can initialize and use multiple variables too, as shown here: robin ~/Lesson4 $ for ((x=0,y=5; x < 5; x++, y--)) ; do echo "Counter: $x $y"; done Counter: 0 5 Counter: 1 4 Counter: 2 3 Counter: 3 2 Counter: 4 1 Remember that these three parts in this construct only accept arithmetic expressions, and the TEST expression is true if it evaluates to nonzero (as we discussed previously).

246 | Shell Scripting

echo So far, we only used the echo command to display simple text strings. However, echo has a couple of flags that are useful: • -n: Does not print a newline at the end. For example, look at the following snippet: robin ~/Lesson4 $ echo -n Hello && echo " world" Hello world • -e: Enables backslash escape characters, as shown here: robin ~/Lesson4 $ echo -e "\t\tHello" Hello

Array Reverse Indexing The shell provides a way to specify array indices from the end instead of the beginning. The regular way to index arrays is ${arr[@]:IDX:LEN}, which returns a sub-array of length N from the (zero-based) position IDX onwards. We can specify a negative index to get the elements relative to the end. This is very convenient for getting the last element easily without dealing with the length of the array. For example: robin ~/Lesson4 $ arr=(0 1 2 3 4 5 6 7 8 9 10) robin ~/Lesson4 $ echo ${arr[@]} 0 1 2 3 4 5 6 7 8 9 10 robin ~/Lesson4 $ echo ${arr[@]:1:3} 1 2 3 robin ~/Lesson4 $ echo ${arr[@]: -3: 2} 8 9 robin ~/Lesson4 $ echo ${arr[@]: -5} 6 7 8 9 10 If a negative index is used, it is mandatory that the minus sign is preceded by whitespace. Skipping the LEN part gets all the elements until the end.

Tips and Tricks | 247

shopt The shopt command changes the way the shell behaves. It can be useful in many situations, especially in scripts. It sets a shell option to on or off. This command takes one or more shell option names as an argument with the following flags: • -s OPTNAME: Enables (sets) each optname. • -u OPTNAME: Disables (unsets) each optname. • -q OPTNAME1 OPTNAME2 ...: Queries the list of options and returns a success (zero) exit code if all of them are enabled. A brief list of shell options for Bash that can be set are as follows: • autocd: If set, typing a directory name changes the CWD to it like the cd command. • dotglob: If set, filenames beginning with a . will be matched by wildcards. Usually, such files are not displayed (you can use the -a flag of ls to list such files). • extglob: If set, extended wildcard expansion is enabled (see the next section). • interactive_comments: If a # character appears on a line typed in a command line, everything after that is ignored. This option is enabled by default. • nocaseglob: If set, Bash performs completions for files ignoring case. • nocasematch: If set, Bash performs case-insensitive pattern matching for case statements and [[ conditional commands. • shift_verbose: If this option is set, the built-in shift command will print an error message if there are not enough positional parameters to shift by the number specified number. There are many more shell options that we will not cover here. The curious students can refer to the man pages for more details.

248 | Shell Scripting

Extended Wildcards When the shell's extglob option is not set, the following constructs are available: • The ? and * symbols, which we learned about in the first chapter. • [abc]: matches any of a, b, c • [!abc]: matches anything not a, b, c • [a-f]: matches any of a to f • [[:CLASS:]] syntax for matching character classes Once extglob is enabled, the following constructs become possible: • ?(PATTERNS): Optionally matches PATTERNS. • *(PATTERNS): Matches any one of the expressions in PATTERNS any number of times (including zero times). • +(PATTERNS): Matches any one of the expressions in PATTERNS one or more times. • @(PATTERNS): Matches any one of the expressions in PATTERNS exactly once. • !(PATTERNS): Matches anything except one of the expressions in PATTERNS. Here, PATTERNS is a list of wildcard expressions separated by a | symbol. For example, !(*.jpg|*.gif) matches any file that does not have a .jpg or .gif extension, and @ (ba*(na)|a+(p)le) matches "ba," "bana," "banana," "bananana," and so on or "aple," "apple," "apppple," and so on. They are similar to regexes but have a different syntax, so be careful not to confuse the syntax. For the most part, you will rarely need extended globbing, except for the ! operator.

man and info Pages We have mentioned the man (manual) pages before. They contain comprehensive documentation about all the standard commands. New programs that are installed on a system install their own manual pages, too.

Tips and Tricks | 249 To view the help of any command, just run man with one argument specifying the item for which you need the manual page. info is a more comprehensive help system available with GNU. It has an interactive hyperlink based interface and will automatically show man pages too. The user interface of man is exactly the same as the one in less. The best way to learn about these commands is to use info info and man man to let the two commands themselves describe how you should use them. Note that the shell built-ins do not have their own man pages, but are described under the man pages for Bash and can be accessed with man bash. You can also use the help command to get usage information for built-in commands individually.

shellcheck shellcheck is a tool that's available in both online and offline versions to check your scripts for possible errors and bad practices. When developing scripts, use this tool and try to follow all the suggestions it gives to ensure that your scripts do not have potential failure cases.

Activity 10: PGN Game Extractor Enhancement In the previous exercises, we incrementally developed a shell script to extract chess games from a PGN file. This script has a few shortcomings, though. In this activity, you need to address those shortcomings of the script: 1. Think of a better (and faster) way to count moves, rather than using wc. 2. Think of a better way to detect blank lines than using a regex. Assume that the blank lines have no whitespace and are empty. 3. Make the script more readable. 4. Change the script to detect if no arguments are specified and print a simple help text describing what the script can do and what the options are. 5. If there are no games to show for -m, show the message No games. If the game index passed to -n is more than the last game in the file, show Invalid game index. Maximum is N, where N represents the count of games in the file.

250 | Shell Scripting Follow the steps given here to complete this activity and create five scripts in the files pgn_extract_act1.sh to pgn_extract_act5.sh: 1. For the first problem, consider starting by splitting the moves list with IFS='.' and extracting the last move number from that array. 2. For the second problem, detect blank lines by testing for an empty string. 3. To make the script more readable, use (( EXPR )) based syntax for numerical tests and increments. 4. For the fourth problem, add a default case to the case statement that parses the options, and print the command usage help there. 5. For the last problem, maintain a count of games that passed the filter in a variable called game_count. Also measure the time taken for the script at each of the stages. The final script should be at least 3 to 5 times as fast as the initial version for both operations. Note The solution for this activity can be found on page 279.

Practical Case Study 2: NYC Yellow Taxi Trip Analysis In this case study, we will incrementally develop another script to process data. For this example, we will deal with a much larger dataset than the previous one. Note The kind of operations we will attempt on the data here are more complex than those in the previous study. In particular, we will process every line of the file individually in complex ways. Sometimes, it is better to use some external tools such as awk or even a Python script for this process since the shell has its limits in terms of performance, especially when we do not use pipelines. This example tries more to demonstrate how to program with the shell and does not suggest that the student should always blindly use only the shell.

Practical Case Study 2: NYC Yellow Taxi Trip Analysis | 251

Understanding the Dataset The dataset we will use for this is a text file that contains public data about yellow taxi trips in New York City for 2017. We will use a subset of 200,000 lines of that data for this book. The file is in CSV format and contains one line of data for every trip. The following fields are present: • Pickup time: Lists the pickup dates and times in the format YYYY-MM-DD HH:MM:SS. For example, 2017-01-09 11:38:20. • Dropoff time: Lists the drop-off time in the same format as for pickup time. • Passenger count: Lists the number of passengers who took that trip. • Trip distance in miles: Lists the total distance covered. • Total fare amount: Lists the total amount charged for the trip. We will develop scripts that help us extract some statistics about the data involved in the following exercises.

Exercise 22: Taxi Trip Analysis – Extracting Trip Time As the first task, let's process the CSV file to add another field that contains the time taken by the trip. We need to subtract the time values and convert it into a number of seconds: 1. Let's write the first version of a function called trip_duration, which can take the pickup time and dropoff time of a trip as arguments and calculate the number of seconds between the two time values. For convenience, you can type it in an editor first and then paste the whole thing carefully into the command line. First, we will split the dates and times into arrays using a space as the IFS and store the dates in two different variables, dt_start and dt_stop (which contain the time value as HH:MM:SS), as follows: function trip_duration() {   IFS=' '   local dt_start=( $1 )   local dt_stop=( $2 )

252 | Shell Scripting 2. Then, we will split the times into hours, minutes, and seconds using the colon as the delimiter and store the resultant values in t_start and t_stop for dt_start and dt_stop, respectively, as shown here:   IFS=':'   local t_start=( ${dt_start[1]} )   local t_stop=( ${dt_stop[1]} ) 3. Next, we will convert the hours and minutes to the absolute seconds and sum all three values, storing the results in the n_start and n_stop variables:   local n_start=$(( t_start[0] * 3600 + t_start[1] * 60 + t_start[2] ))   local n_stop=$(( t_stop[0] * 3600 + t_stop[1] * 60 + t_stop[2] )) 4. Finally, we will print the difference between the two input times and close the function:   echo $(( n_stop - n_start )) } 5. Now, we will test the preceding function using the following arguments: robin ~/Lesson4 $ trip_duration '2017-10-10 12:00:00' '2017-10-10 16:00:00' 14400 robin ~/Lesson4 $ trip_duration '2017-10-10 12:00:00' '2017-10-10 13:59:00' 7140 robin ~/Lesson4 $ trip_duration '2017-10-10 12:00:00' '2017-10-10 00:00:00' -43200 6. From the last output in the previous step, you will observe that we have a problem if the stop time is after the start time. This can happen if a trip crosses midnight. If we can assume that no trip ever exceeds 24 hours (which is a given), then we can fix this problem very easily by making the following modification to the last line of the function: echo $(( ((n_stop - n_start) + 86400) % 86400 ))

Practical Case Study 2: NYC Yellow Taxi Trip Analysis | 253 We add 86,400 (the number of seconds in a day) and apply modulus with the same value. This will convert negative values to the right answer. You can save the updated version of trip_duration as a script file named taxi1.sh. Note Refer to the taxi1.sh to taxi9.sh files that are supplied with this code bundle for the set of scripts that were used in this case study.

7. Now, if we test for the last example, we get the right answer: robin ~/Lesson4 $ ./taxi1.sh '2017-10-10 12:00:00' '2017-10-10 00:00:00' 43200 8. Next, let's extend this script to read lines of data and print the duration for each trip, as follows. For this, we will use a while loop that reads each line, and splits it using the comma as a delimiter. We pass the first two fields to trip_duration, which prints the duration of each trip: while read -r line do   IFS=','   fields=( $line )   trip_duration "${fields[0]}" "${fields[1]}" done Save this script as ./taxi2.sh. 9. Test this script for the first five values of the dataset, as follows: robin ~/Lesson4 $ ./taxi2.sh test.txt real 0m0.549s user 0m0.450s sys 0m0.098s We gained some performance by avoiding local variables. Declaring a local variable has some penalty because it is created and destroyed every time the function is called. The function logic is simpler now and also, in the main code, we no longer need to reset IFS within the loop, thus giving a small speedup.

258 | Shell Scripting 19. Now, let's go ahead and generate the new data that contains this "calculated field" of trip duration: robin ~/Lesson4 $./taxi6.sh nyc_taxi2.csv robin ~/Lesson4 $ head -n5 nyc_taxi2.csv 2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,15.30,737 2017-01-09 11:32:27,2017-01-09 11:36:01,1,0.90,7.25,214 2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,7.30,225 2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,8.50,323 2017-01-01 00:00:02,2017-01-01 00:03:50,1,0.50,5.30,228 20. Let's also see how many invalid rows were eliminated: robin ~/Lesson4 $ wc -l nyc_taxi.csv 200000 nyc_taxi.csv robin ~/Lesson4 $ wc -l nyc_taxi2.csv 193305 nyc_taxi2.csv

Exercise 23: Taxi Trip Analysis – Calculating Average Trip Speed Now, we can try to run some statistics on the data. For example, we can sum the distances and durations, and then get the average speed: 1. The first problem we face is that the fields for fare and distance have decimal numbers, whereas Bash only performs integer arithmetic. We can get around this problem by noticing that the distances in the data file always have two decimal places of precision. We can change the units to a unit of 1/100 of a mile by getting rid of the . symbol. This is a simple string operation wherein we can use a pipeline with cut and tr to create a temporary file called test.txt with just the two columns: robin ~/Lesson4 $ cut -d, -f4,6 < nyc_taxi2.csv | tr -d '.' >test.txt Print the first five instances to verify that the operation works as desired: robin ~/Lesson4 $ head -n5 test.txt 330,737 090,214 110,225 110,323 050,228

Practical Case Study 2: NYC Yellow Taxi Trip Analysis | 259 2. Now that we can calculate our distance (in units of 100ths of a mile) and time in seconds, let's write a script that can calculate the average speed of a trip. We will merge the previous cut command into this script and make the script take the data filename as an argument. The script will expect the input data in the format of nyc_taxi2.csv. First, we will create a temporary file to store the data for the fare and distance fields (this will be deleted at the end of the script). Open your text editor and write the following lines of code: temp_file=temp${RANDOM} # cut the 4th and 6th column - distance and duration, get rid of the decimal point, # put the result in the temp file cut -d, -f4,6 "$1" | tr -d '.' >$temp_file Note For simplicity, we created a temporary filename based on $RANDOM. This is not ideal because, if the file already exists, it gets overwritten. In these examples, we do not care, but ideally, we would use the mktemp command, which creates a randomly named file that definitely does not exist on the filesystem.

3. Next, we will read through each line, and sum the distances of each trip as well as the durations. The total distance divided by the total time is the average speed: total_duration=0 total_distance=0 IFS=',' while read distance duration do   ((total_duration += 10#${duration}))   ((total_distance += 10#${distance})) done