{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "celltoolbar": "Slideshow", "interpreter": { "hash": "eea1ea0ecaf1b6c9c37bdc3c13e4a6538d638c8bc02013b9db3c15d084051188" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "papermill": { "default_parameters": {}, "duration": 81.080472, "end_time": "2021-05-19T14:09:09.666398", "environment_variables": {}, "exception": null, "input_path": "__notebook__.ipynb", "output_path": "__notebook__.ipynb", "parameters": {}, "start_time": "2021-05-19T14:07:48.585926", "version": "2.3.3" }, "colab": { "name": "FDP_DataScience.ipynb", "provenance": [], "collapsed_sections": [ "nxvEkGXPM3Xh", "2Mekz3nHf5AA", "67VQ-Sec5DPV", "w727DnK4-DOc", "v7kM_SZV-DOc", "RkjWVTHu-DO6", "iHMqBtt8-DPd", "bccDMQzq-DPo", "hto5zbSL-DPr", "tEINf4bEL9jR", "U5Z_oMoLL9jV", "R5IeAY03L9ja", "oQOE7l55CjDb", "9XgpYu5cCDpw", "RWMXqvfuCDp5", "oHSffVJsCDp_", "lJYf3H2NKXWi", "D33vfL2xKXWn", "260I-jj1N_jN" ] }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "a34617df" }, "source": [ "

Practical Introduction to Data Science with Python



\n", "
\n", "

\n", "

Anand S Menon, Kannan K, 06-07-2021

" ], "id": "a34617df" }, { "cell_type": "markdown", "metadata": { "id": "42f285bf" }, "source": [ " " ], "id": "42f285bf" }, { "cell_type": "markdown", "metadata": { "id": "891471fd" }, "source": [ " " ], "id": "891471fd" }, { "cell_type": "markdown", "metadata": { "id": "6d884875" }, "source": [ "

Agenda

\n", "\n", "### 1. Introduction to Data Science\n", " 1.1 What is Data Science ?\n", " 1.2 Why do we need Data Science ?\n", " 1.3 Brief Overview of Topics\n", " 1.3.1 Big Data Analytics\n", " 1.3.2 Machine Learning & Deep Learning\n" ], "id": "6d884875" }, { "cell_type": "markdown", "metadata": { "id": "e9f055e9" }, "source": [ "### 2. Pythonic way of Data Science\n", " 2.1 Brief Intro to Python Programming Language\n", " 2.2 Python for Data Science\n", " 2.3 Data Processing, Statistical analysis and Visualization\n", " 2.3.1 numpy, pandas, scipy\n", " 2.3.2 matplotlib, seaborn, plotly\n", " 2.4 Model Building Frameworks\n", " 2.4.1 Scikit Learn, Tensorflow, Pytorch" ], "id": "e9f055e9" }, { "cell_type": "markdown", "metadata": { "id": "b553ff9b" }, "source": [ "### 3. Approaching a Tabular(Structured) Problem \n", " 3.1 Understanding the Problem\n", " 3.2 Exploratory Data Analysis\n", " 3.2 Data Preprocessing\n", " 3.3 Feature Engineering\n", " 3.4 Model Building and Inference\n" ], "id": "b553ff9b" }, { "cell_type": "markdown", "metadata": { "id": "6c5cc163" }, "source": [ "### 4. Approaching a Text (NLP) Problem \n", " 4.1 Importance of solving NLP problems\n", " 4.2 Applications of NLP\n", " 4.3 Intro to NLP\n", " 4.4 NLP using python\n", " 4.5 Approaching real life NLP problem" ], "id": "6c5cc163" }, { "cell_type": "markdown", "metadata": { "id": "okrqD3E-hjpi" }, "source": [ "#### 5. Approaching a Vision Problem (Hands On)\n", " 4.1 An introduction to computer vision\n", " 4.1.1 What is Computer Vision?\n", " 4.1.2 How is computer vision used today?\n", " 4.2 Image Processing\n", " 4.2.1 Point Operators\n", " 4.2.1.1 Pixel Transforms\n", " 4.2.1.2 Color Transforms\n", " 4.2.1.3 Compositing and matting\n", " 4.2.1.4 Histogram Equalization\n", " 4.2.2 Linear Filtering\n", " 4.2.2.1 Separable Filtering\n", " 4.2.2.2 Band Pass and Steerable Filters\n", " 4.2.3 More neighborhood operators\n", " 4.2.3.1 Non-linear filtering\n", " 4.2.3.2 Bilateral filtering\n", " 4.2.3.3 Binary Image processing\n", " 4.2.4 Fourier Transforms\n", " 4.2.4.1 Two-dimensional Fourier Transforms\n", " 4.2.5 Pyramid and wavelets\n", " 4.2.5.1 Interpolation\n", " 4.2.5.2 Decimation\n", " 4.2.5.3 Multi-resolution representations\n", " 4.2.5.4 Wavelts\n", " 4.2.6 Geometrics transformations\n", " 4.2.6.1 Parametric transformations\n", " 4.2.6.2 Mesh-based warping\n", " 4.3 OpenCV Library [Hands On]\n", " 4.3.1 Introduction\n", " 4.3.2 Changing colorspaces\n", " 4.3.3 Geometric transformations of Images\n", " 4.3.4 Image thresholding\n", " 4.3.5 Smoothing Images\n", " 4.3.6 Morphological Transformations\n", " 4.3.7 Image Gradients\n", " 4.3.8 Canny Edge Detection\n", " 4.3.9 Image Pyramids\n", " 4.3.10 Contours\n", " 4.3.11 Histograms\n", " 4.3.12 Image Transforms" ], "id": "okrqD3E-hjpi" }, { "cell_type": "markdown", "metadata": { "id": "82d016bb" }, "source": [ "

1. Introduction to Data Science


\n" ], "id": "82d016bb" }, { "cell_type": "markdown", "metadata": { "id": "4c9f5e7f" }, "source": [ "

1.1 What is Data Science?


\n", "\n", "\n", "

\"Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.\" - Wikipedia


\n" ], "id": "4c9f5e7f" }, { "cell_type": "markdown", "metadata": { "id": "b1474823" }, "source": [ "

In layman's terms it is a combination of different fields like computer science, maths, statistics, domain knowledge ..etc, which works together in order to fetch, process and evaluate huge amount of data for better bussiness decision making.


\n", "\n", "" ], "id": "b1474823" }, { "cell_type": "markdown", "metadata": { "id": "15402732" }, "source": [ "

1.2 Why do we need Data Science?


\n", "\n", "\n", "

The answer to that question is pretty simple. Data Science would help companies in saving billions of dollar by taking right decisions at right time with assistance of data.
Data Science is all about data and making sense of data will reduce the horrors of uncertainties for any organizations.


\n", "\n", "

Did you know that Southwest Airlines, at one point, was able to save $100 million by leveraging data? They could reduce their planes’ idle time that waited at the tarmac and drive a change in utilizing their resources. In short, today, it is not possible for any business to imagine a world without data.


\n", "\n", "\n" ], "id": "15402732" }, { "cell_type": "markdown", "metadata": { "id": "6da530e8" }, "source": [ "

Data statistics facts as of 2021

\n", "\n", "

\"Google gets over 3.5 billion searches daily.\"

\n", "\n", "

\"WhatsApp users exchange up to 65 billion messages daily.\"

\n", "\n", "

\"In 2020, every person generated 1.7 megabytes per second\"

\n", "\n", "

\"80-90% of the data we generate today is unstructured.\"

\n", "\n", "

\"On average, 500 million tweets are shared every day. This can be further broken down to 6,000 tweets per second, 350,000 tweets per minute, and around 200 billion tweets every year.\"


\n", "\n", "\n", "

The above facts clearly indicates the data explosion and we need ways to ingest, validate, process and infer data which all comes under the Data Science umbrella

\n" ], "id": "6da530e8" }, { "cell_type": "markdown", "metadata": { "id": "86b00819" }, "source": [ "

Data Explosion


\n", "

Data is growing at pace we have not imagined before

\n", "\n", " " ], "id": "86b00819" }, { "cell_type": "markdown", "metadata": { "id": "e03cd9e6" }, "source": [ "

1.3.1 Big Data Analytics


\n", "

What is big data exactly? It can be defined as data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Characteristics of big data include high volume, high velocity and high variety

Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.

With big data analytics, you can ultimately fuel better and faster decision-making, modelling and predicting of future outcomes and enhanced business intelligence.

Some of commonly used frameworks are Apache Hadpoop, Apacha spark, Cassandra ..etc

" ], "id": "e03cd9e6" }, { "cell_type": "markdown", "metadata": { "id": "7d2f77c2" }, "source": [ "

1.3.2 Machine Learning & Deep Learning


\n", "\n", "

* Machine Learning involves algorithms that learn from patterns of data and then apply it to decision making.

\n", "\n", "

* Deep Learning is able to learn through processing data on its own using neural nets and is quite similar to the human brain where it identifies something, analyse it, and makes a decision.


\n", "\n", "\n", "\n" ], "id": "7d2f77c2" }, { "cell_type": "markdown", "metadata": { "id": "562393ef" }, "source": [ "## 2.Pythonic way of data science" ], "id": "562393ef" }, { "cell_type": "markdown", "metadata": { "id": "kK70oQo5Uzqx" }, "source": [ "## **Brief Intro to Python Programming Language**\n", "\n", "**Python is an interpreted, high-level, general-purpose programming language. It is is dynamically typed and garbage-collected.**\n", "\n", "One of the most popular and fastest growing languages programming language in the world (StackOverflow survey 2020). First appeared on February 1991 (30 years ago). \n", "\n", "Highly efficient to write and programs tends to be clear and readable. well suitble for application like data science and web development.\n", "\n", "Used in a wide range of fields, from healthcare to finance to VFX and AI.\n", "\n", "The Father Of Python: Guido Van Rossum \n", "\n" ], "id": "kK70oQo5Uzqx" }, { "cell_type": "markdown", "metadata": { "id": "BrQydd54dW33" }, "source": [ "\n", "**Internal working of Python**\n", "
" ], "id": "BrQydd54dW33" }, { "cell_type": "markdown", "metadata": { "id": "gyWhzgOCuhDK" }, "source": [ "## **Python way of Data Science**\n", "\n", "Python provide great functionality to deal with mathematics, statistics and scientific function. It provides great libraries to deals with data science application.\n", "\n" ], "id": "gyWhzgOCuhDK" }, { "cell_type": "markdown", "metadata": { "id": "tB-mDACGumyw" }, "source": [ "* It’s Flexible\n", "* It’s Easy to Learn\n", "* It’s Open Source\n", "* It’s Well-Supported\n", "\n", "\n", "As is the case with many other programming languages, it’s the available libraries that lead to Python’s success: some 72,000 of them in the Python Package Index (PyPI) and growing constantly.\n" ], "id": "tB-mDACGumyw" }, { "cell_type": "markdown", "metadata": { "id": "XLOAbjeFXIir" }, "source": [ "### **Who is it for?**\n", "\n", "#### **Students**\n", "The use of short English keyword instead of symbols makes Python particularly friend;y for begineers.\n" ], "id": "XLOAbjeFXIir" }, { "cell_type": "markdown", "metadata": { "id": "CNvhRs9RYJLk" }, "source": [ "Python\n", "\n", "```python\n", "# print the integers from 1 to 0\n", "for i in range(1,10):\n", " print(i)\n", "```\n", "\n", "Java\n", "\n", "```java\n", "// print the integers from 1 to 9\n", "for(int i=1;i<10;i++){\n", " System.out.println(i);\n", "}\n", "```\n", "\n", "Python hanldes a lot of memory automatically, so less time dealing with pointers and references and so on than other lanugages like C++.\n", "\n" ], "id": "CNvhRs9RYJLk" }, { "cell_type": "markdown", "metadata": { "id": "ut1XxPaOYzFz" }, "source": [ "### **Developer who prioritize ease of programming over execution speed**\n", "\n", "Python runs slower than other common languages like C++ or Java, but its often quick to write (three to five times faster than Java and five to ten times than C++).\n", "\n", "One of the main reason why Python is used in \"support language\" for tasks like build control, automated testing and bug tracking. For the same reason used for prototyping." ], "id": "ut1XxPaOYzFz" }, { "cell_type": "markdown", "metadata": { "id": "OejznBlDZ6m-" }, "source": [ "### **Developers working with text and numbers**\n", "\n", "Python have powerful String anf List manipulation functions. Its a great tool for exploring data.\n", "\n", "Avoiding the need for compiling makes it easier to run interactively on data for step by step manipulation. (Just like this Jupyter notebook).\n", "\n", "Python's core functionality can be extended with a huge number of packages which are mainly used by the scientific and academic communities as well as engineering. \n", "\n", "These include the Numpy(mathematical function and data analysis), SciPy(for science), Matplotlib(data visualization) and Pandas(data analysis) and more recently Tensorflow and Keras (Machine Learning/AI) and tens of thousands of others." ], "id": "OejznBlDZ6m-" }, { "cell_type": "markdown", "metadata": { "id": "oB2GwvzWbdi3" }, "source": [ "### **Web Developers**\n", "\n", "Backend development for webapplication. Python has built-in supports for common protocols like HTML, XML and Json.\n", "\n", "Frameworks like Django, Flask and Bottle are widely used. For example Reddit's backend in Python.\n" ], "id": "oB2GwvzWbdi3" }, { "cell_type": "markdown", "metadata": { "id": "v7YzmVyPcIKf" }, "source": [ "### **The Zen of Python**\n", "\n", "The Zen of Python is a collection of 19 \"guiding principles\".\n", "\n", "Some are:\n", "* Readability counts.\n", "* Simple is better than complex\n", "* There should be one - and preferably only one - obvious way to do it.\n", "\n", "\n", "\n", "\n" ], "id": "v7YzmVyPcIKf" }, { "cell_type": "markdown", "metadata": { "id": "S7ZPPpOLfu3s" }, "source": [ "## **Intro to Data Processing, Statistical analysis and Visualization libraries**" ], "id": "S7ZPPpOLfu3s" }, { "cell_type": "markdown", "metadata": { "id": "nxvEkGXPM3Xh" }, "source": [ "## A Brief Note on Python Versions\n", "\n", "As of Janurary 1, 2020, Python has [officially dropped support](https://www.python.org/doc/sunset-python-2/) for `python2`. " ], "id": "nxvEkGXPM3Xh" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1L4Am0QATgOc", "outputId": "4c759d29-d563-44e4-f2b0-ae1eabc5f90c" }, "source": [ "!python --version" ], "id": "1L4Am0QATgOc", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Python 3.7.11\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "2Mekz3nHf5AA" }, "source": [ "## **Numpy**\n", "\n", "Reference: [Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition Python Numpy Tutorial](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb)" ], "id": "2Mekz3nHf5AA" }, { "cell_type": "markdown", "metadata": { "id": "fY12nHhyL9hX" }, "source": [ "Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays." ], "id": "fY12nHhyL9hX" }, { "cell_type": "markdown", "metadata": { "id": "lZMyAdqhL9hY" }, "source": [ "To use Numpy, we first need to import the `numpy` package:" ], "id": "lZMyAdqhL9hY" }, { "cell_type": "code", "metadata": { "id": "58QdX8BLL9hZ" }, "source": [ "import numpy as np" ], "id": "58QdX8BLL9hZ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "DDx6v1EdL9hb" }, "source": [ "###Arrays" ], "id": "DDx6v1EdL9hb" }, { "cell_type": "markdown", "metadata": { "id": "f-Zv3f7LL9hc" }, "source": [ "A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension." ], "id": "f-Zv3f7LL9hc" }, { "cell_type": "markdown", "metadata": { "id": "_eMTRnZRL9hc" }, "source": [ "We can initialize numpy arrays from nested Python lists, and access elements using square brackets:" ], "id": "_eMTRnZRL9hc" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-l3JrGxCL9hc", "outputId": "a556282a-2546-4fbc-94f6-606e124600e0" }, "source": [ "a = np.array([1, 2, 3]) # Create a rank 1 array\n", "print(type(a), a.shape, a[0], a[1], a[2])\n", "a[0] = 5 # Change an element of the array\n", "print(a) " ], "id": "-l3JrGxCL9hc", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ " (3,) 1 2 3\n", "[5 2 3]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ma6mk-kdL9hh", "outputId": "027b98e9-06d9-4cbb-efe7-62a618069d55" }, "source": [ "b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array\n", "print(b)" ], "id": "ma6mk-kdL9hh", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1 2 3]\n", " [4 5 6]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ymfSHAwtL9hj", "outputId": "16b5f2d4-a194-4921-8e63-096335b4a8ed" }, "source": [ "print(b.shape)\n", "print(b[0, 0], b[0, 1], b[1, 0])" ], "id": "ymfSHAwtL9hj", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "(2, 3)\n", "1 2 4\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "F2qwdyvuL9hn" }, "source": [ "Numpy also provides many functions to create arrays:" ], "id": "F2qwdyvuL9hn" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mVTN_EBqL9hn", "outputId": "7c57cf44-1dc4-4994-f292-b0042afd3549" }, "source": [ "a = np.zeros((2,2)) # Create an array of all zeros\n", "print(a)" ], "id": "mVTN_EBqL9hn", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[0. 0.]\n", " [0. 0.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "skiKlNmlL9h5", "outputId": "e0d7e060-31c5-4366-938e-9db4da1ceeab" }, "source": [ "b = np.ones((1,2)) # Create an array of all ones\n", "print(b)" ], "id": "skiKlNmlL9h5", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1. 1.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HtFsr03bL9h7", "outputId": "3a894ac6-3fcd-448d-86ac-8ffba2dddc6b" }, "source": [ "c = np.full((2,2), 7) # Create a constant array\n", "print(c)" ], "id": "HtFsr03bL9h7", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[7 7]\n", " [7 7]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-QcALHvkL9h9", "outputId": "6f0d66c7-6916-4020-d796-015b6102fc3f" }, "source": [ "d = np.eye(2) # Create a 2x2 identity matrix\n", "print(d)" ], "id": "-QcALHvkL9h9", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1. 0.]\n", " [0. 1.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RCpaYg9qL9iA", "outputId": "4f6fe37c-9471-4803-e28d-578e6e75b8a0" }, "source": [ "e = np.random.random((2,2)) # Create an array filled with random values\n", "print(e)" ], "id": "RCpaYg9qL9iA", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[0.18008187 0.07831069]\n", " [0.3829961 0.69663983]]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "jI5qcSDfL9iC" }, "source": [ "###Array indexing" ], "id": "jI5qcSDfL9iC" }, { "cell_type": "markdown", "metadata": { "id": "M-E4MUeVL9iC" }, "source": [ "Numpy offers several ways to index into arrays." ], "id": "M-E4MUeVL9iC" }, { "cell_type": "markdown", "metadata": { "id": "QYv4JyIEL9iD" }, "source": [ "Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:" ], "id": "QYv4JyIEL9iD" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wLWA0udwL9iD", "outputId": "db217881-ea42-4666-8f94-d18362543b48" }, "source": [ "import numpy as np\n", "\n", "# Create the following rank 2 array with shape (3, 4)\n", "# [[ 1 2 3 4]\n", "# [ 5 6 7 8]\n", "# [ 9 10 11 12]]\n", "a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n", "\n", "# Use slicing to pull out the subarray consisting of the first 2 rows\n", "# and columns 1 and 2; b is the following array of shape (2, 2):\n", "# [[2 3]\n", "# [6 7]]\n", "b = a[:2, 1:3]\n", "print(b)" ], "id": "wLWA0udwL9iD", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[2 3]\n", " [6 7]]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "KahhtZKYL9iF" }, "source": [ "A slice of an array is a view into the same data, so modifying it will modify the original array." ], "id": "KahhtZKYL9iF" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1kmtaFHuL9iG", "outputId": "a1904677-760b-4b06-98a4-e40c1166a4f9" }, "source": [ "print(a[0, 1])\n", "b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]\n", "print(a[0, 1]) " ], "id": "1kmtaFHuL9iG", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "2\n", "77\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "_Zcf3zi-L9iI" }, "source": [ "You can also mix integer indexing with slice indexing. However, doing so will yield an array of lower rank than the original array." ], "id": "_Zcf3zi-L9iI" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "G6lfbPuxL9iJ", "outputId": "ed574c53-725e-4aab-ea77-559030ca0c4f" }, "source": [ "# Create the following rank 2 array with shape (3, 4)\n", "a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])\n", "print(a)" ], "id": "G6lfbPuxL9iJ", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[ 1 2 3 4]\n", " [ 5 6 7 8]\n", " [ 9 10 11 12]]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "NCye3NXhL9iL" }, "source": [ "Two ways of accessing the data in the middle row of the array.\n", "Mixing integer indexing with slices yields an array of lower rank,\n", "while using only slices yields an array of the same rank as the\n", "original array:" ], "id": "NCye3NXhL9iL" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EOiEMsmNL9iL", "outputId": "14aeab20-fc8c-4d3b-9d82-02867f7b94b7" }, "source": [ "row_r1 = a[1, :] # Rank 1 view of the second row of a \n", "row_r2 = a[1:2, :] # Rank 2 view of the second row of a\n", "row_r3 = a[[1], :] # Rank 2 view of the second row of a\n", "print(row_r1, row_r1.shape)\n", "print(row_r2, row_r2.shape)\n", "print(row_r3, row_r3.shape)" ], "id": "EOiEMsmNL9iL", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[5 6 7 8] (4,)\n", "[[5 6 7 8]] (1, 4)\n", "[[5 6 7 8]] (1, 4)\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JXu73pfDL9iN", "outputId": "13f315ca-dc9d-416a-d2d7-781a2d981417" }, "source": [ "# We can make the same distinction when accessing columns of an array:\n", "col_r1 = a[:, 1]\n", "col_r2 = a[:, 1:2]\n", "print(col_r1, col_r1.shape)\n", "print()\n", "print(col_r2, col_r2.shape)" ], "id": "JXu73pfDL9iN", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[ 2 6 10] (3,)\n", "\n", "[[ 2]\n", " [ 6]\n", " [10]] (3, 1)\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "kaE8dBGgL9id" }, "source": [ "Boolean array indexing: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:" ], "id": "kaE8dBGgL9id" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "32PusjtKL9id", "outputId": "62637a7f-9cea-4f78-c304-72299fb0354f" }, "source": [ "import numpy as np\n", "\n", "a = np.array([[1,2], [3, 4], [5, 6]])\n", "\n", "bool_idx = (a > 2) # Find the elements of a that are bigger than 2;\n", " # this returns a numpy array of Booleans of the same\n", " # shape as a, where each slot of bool_idx tells\n", " # whether that element of a is > 2.\n", "\n", "print(bool_idx)" ], "id": "32PusjtKL9id", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[False False]\n", " [ True True]\n", " [ True True]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cb2IRMXaL9if", "outputId": "15622e1e-987e-4471-c11b-0dd9d2cff1a7" }, "source": [ "# We use boolean array indexing to construct a rank 1 array\n", "# consisting of the elements of a corresponding to the True values\n", "# of bool_idx\n", "print(a[bool_idx])\n", "\n", "# We can do all of the above in a single concise statement:\n", "print(a[a > 2])" ], "id": "cb2IRMXaL9if", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[3 4 5 6]\n", "[3 4 5 6]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "jTctwqdQL9ih" }, "source": [ "###Datatypes" ], "id": "jTctwqdQL9ih" }, { "cell_type": "markdown", "metadata": { "id": "kSZQ1WkIL9ih" }, "source": [ "Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:" ], "id": "kSZQ1WkIL9ih" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4za4O0m5L9ih", "outputId": "6a16e744-fd97-4fb4-c87e-13680897b8f9" }, "source": [ "x = np.array([1, 2]) # Let numpy choose the datatype\n", "y = np.array([1.0, 2.0]) # Let numpy choose the datatype\n", "z = np.array([1, 2], dtype=np.int64) # Force a particular datatype\n", "\n", "print(x.dtype, y.dtype, z.dtype)" ], "id": "4za4O0m5L9ih", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "int64 float64 int64\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "RLVIsZQpL9ik" }, "source": [ "You can read all about numpy datatypes in the [documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)." ], "id": "RLVIsZQpL9ik" }, { "cell_type": "markdown", "metadata": { "id": "TuB-fdhIL9ik" }, "source": [ "###Array math" ], "id": "TuB-fdhIL9ik" }, { "cell_type": "markdown", "metadata": { "id": "18e8V8elL9ik" }, "source": [ "Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:" ], "id": "18e8V8elL9ik" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gHKvBrSKL9il", "outputId": "520eb959-d29f-4999-e556-2df5637dfcb5" }, "source": [ "x = np.array([[1,2],[3,4]], dtype=np.float64)\n", "y = np.array([[5,6],[7,8]], dtype=np.float64)\n", "\n", "# Elementwise sum; both produce the array\n", "print(x + y)\n", "print(np.add(x, y))" ], "id": "gHKvBrSKL9il", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[ 6. 8.]\n", " [10. 12.]]\n", "[[ 6. 8.]\n", " [10. 12.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1fZtIAMxL9in", "outputId": "b0733bb2-e448-4ca5-c184-8c7d6edf6847" }, "source": [ "# Elementwise difference; both produce the array\n", "print(x - y)\n", "print(np.subtract(x, y))" ], "id": "1fZtIAMxL9in", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[-4. -4.]\n", " [-4. -4.]]\n", "[[-4. -4.]\n", " [-4. -4.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nil4AScML9io", "outputId": "86bb0e16-18f5-4be1-800e-1faa84beaa0d" }, "source": [ "# Elementwise product; both produce the array\n", "print(x * y)\n", "print(np.multiply(x, y))" ], "id": "nil4AScML9io", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[ 5. 12.]\n", " [21. 32.]]\n", "[[ 5. 12.]\n", " [21. 32.]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0JoA4lH6L9ip", "outputId": "3d391d97-d00c-43b7-8a75-519a01213eb0" }, "source": [ "# Elementwise division; both produce the array\n", "# [[ 0.2 0.33333333]\n", "# [ 0.42857143 0.5 ]]\n", "print(x / y)\n", "print(np.divide(x, y))" ], "id": "0JoA4lH6L9ip", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[0.2 0.33333333]\n", " [0.42857143 0.5 ]]\n", "[[0.2 0.33333333]\n", " [0.42857143 0.5 ]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "g0iZuA6bL9ir", "outputId": "39e81644-613d-45e7-d682-c57a4a0ddd60" }, "source": [ "# Elementwise square root; produces the array\n", "# [[ 1. 1.41421356]\n", "# [ 1.73205081 2. ]]\n", "print(np.sqrt(x))" ], "id": "g0iZuA6bL9ir", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1. 1.41421356]\n", " [1.73205081 2. ]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "I3FnmoSeL9iu", "outputId": "de799a09-265c-4368-f2e4-51d87c204069" }, "source": [ "x = np.array([[1,2],[3,4]])\n", "y = np.array([[5,6],[7,8]])\n", "\n", "v = np.array([9,10])\n", "w = np.array([11, 12])\n", "\n", "# Inner product of vectors; both produce 219\n", "print(v.dot(w))\n", "print(np.dot(v, w))" ], "id": "I3FnmoSeL9iu", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "219\n", "219\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "vmxPbrHASVeA" }, "source": [ "You can also use the `@` operator which is equivalent to numpy's `dot` operator." ], "id": "vmxPbrHASVeA" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vyrWA-mXSdtt", "outputId": "bb77dbb2-4820-4cab-d451-a1235b74d949" }, "source": [ "print(v @ w)" ], "id": "vyrWA-mXSdtt", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "219\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zvUODeTxL9iw", "outputId": "c030a4a6-2950-437e-da51-2b8cd334a73c" }, "source": [ "# Matrix / vector product; both produce the rank 1 array [29 67]\n", "print(x.dot(v))\n", "print(np.dot(x, v))\n", "print(x @ v)" ], "id": "zvUODeTxL9iw", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[29 67]\n", "[29 67]\n", "[29 67]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3V_3NzNEL9iy", "outputId": "8c37cf2b-5d13-411e-ba89-5257a7bbdfdf" }, "source": [ "# Matrix / matrix product; both produce the rank 2 array\n", "# [[19 22]\n", "# [43 50]]\n", "print(x.dot(y))\n", "print(np.dot(x, y))\n", "print(x @ y)" ], "id": "3V_3NzNEL9iy", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[19 22]\n", " [43 50]]\n", "[[19 22]\n", " [43 50]]\n", "[[19 22]\n", " [43 50]]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "FbE-1If_L9i0" }, "source": [ "Numpy provides many useful functions for performing computations on arrays; one of the most useful is `sum`:" ], "id": "FbE-1If_L9i0" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DZUdZvPrL9i0", "outputId": "9a865f4e-2b61-48cf-8496-db5ce6be11f4" }, "source": [ "x = np.array([[1,2],[3,4]])\n", "\n", "print(np.sum(x)) # Compute sum of all elements; prints \"10\"\n", "print(np.sum(x, axis=0)) # Compute sum of each column; prints \"[4 6]\"\n", "print(np.sum(x, axis=1)) # Compute sum of each row; prints \"[3 7]\"" ], "id": "DZUdZvPrL9i0", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "10\n", "[4 6]\n", "[3 7]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "ahdVW4iUL9i3" }, "source": [ "You can find the full list of mathematical functions provided by numpy in the [documentation](http://docs.scipy.org/doc/numpy/reference/routines.math.html).\n", "\n", "Apart from computing mathematical functions using arrays, we frequently need to reshape or otherwise manipulate data in arrays. The simplest example of this type of operation is transposing a matrix; to transpose a matrix, simply use the T attribute of an array object:" ], "id": "ahdVW4iUL9i3" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "63Yl1f3oL9i3", "outputId": "d0f5fa2c-b40d-4acf-afcd-e37fadf366eb" }, "source": [ "print(x)\n", "print(\"transpose\\n\", x.T)" ], "id": "63Yl1f3oL9i3", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1 2]\n", " [3 4]]\n", "transpose\n", " [[1 3]\n", " [2 4]]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mkk03eNIL9i4", "outputId": "37577142-1e97-4606-96b8-d26d42418406" }, "source": [ "v = np.array([[1,2,3]])\n", "print(v )\n", "print(\"transpose\\n\", v.T)" ], "id": "mkk03eNIL9i4", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "[[1 2 3]]\n", "transpose\n", " [[1]\n", " [2]\n", " [3]]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "67VQ-Sec5DPV" }, "source": [ "## Pandas\n", "\n", "[Reference: Google Machine learning Crash Course Colab Notebooks](https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb)" ], "id": "67VQ-Sec5DPV" }, { "cell_type": "markdown", "metadata": { "id": "TIFJ83ZTBctl" }, "source": [ "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials." ], "id": "TIFJ83ZTBctl" }, { "cell_type": "markdown", "metadata": { "id": "s_JOISVgmn9v" }, "source": [ "### Basic Concepts\n", "\n", "The following line imports the *pandas* API and prints the API version:" ], "id": "s_JOISVgmn9v" }, { "cell_type": "code", "metadata": { "id": "aSRYu62xUi3g", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "496bebc9-7df0-4521-812d-33a40fc6cf0c" }, "source": [ "from __future__ import print_function\n", "\n", "import pandas as pd\n", "pd.__version__" ], "id": "aSRYu62xUi3g", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'1.1.5'" ] }, "metadata": { "tags": [] }, "execution_count": 31 } ] }, { "cell_type": "markdown", "metadata": { "id": "daQreKXIUslr" }, "source": [ "The primary data structures in *pandas* are implemented as two classes:\n", "\n", " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n", " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n", "\n", "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)." ], "id": "daQreKXIUslr" }, { "cell_type": "markdown", "metadata": { "id": "fjnAk1xcU0yc" }, "source": [ "One way to create a `Series` is to construct a `Series` object. For example:" ], "id": "fjnAk1xcU0yc" }, { "cell_type": "code", "metadata": { "id": "DFZ42Uq7UFDj", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "79a0d94d-bd1f-4fe5-a63e-2ace0a24202f" }, "source": [ "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])" ], "id": "DFZ42Uq7UFDj", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 San Francisco\n", "1 San Jose\n", "2 Sacramento\n", "dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 32 } ] }, { "cell_type": "markdown", "metadata": { "id": "U5ouUp1cU6pC" }, "source": [ "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:" ], "id": "U5ouUp1cU6pC" }, { "cell_type": "code", "metadata": { "id": "avgr6GfiUh8t", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "93460921-0d30-4a2d-b218-a044231d0d7f" }, "source": [ "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n", "population = pd.Series([852469, 1015785, 485199])\n", "\n", "pd.DataFrame({ 'City name': city_names, 'Population': population })" ], "id": "avgr6GfiUh8t", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
City namePopulation
0San Francisco852469
1San Jose1015785
2Sacramento485199
\n", "
" ], "text/plain": [ " City name Population\n", "0 San Francisco 852469\n", "1 San Jose 1015785\n", "2 Sacramento 485199" ] }, "metadata": { "tags": [] }, "execution_count": 33 } ] }, { "cell_type": "markdown", "metadata": { "id": "oa5wfZT7VHJl" }, "source": [ "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:" ], "id": "oa5wfZT7VHJl" }, { "cell_type": "markdown", "metadata": { "id": "COMLdx9d6kxM" }, "source": [ "One way to create a `Series` is to construct a `Series` object. For example:" ], "id": "COMLdx9d6kxM" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "472nlbEX6kxN", "outputId": "6ef1a358-35fd-4df5-d2af-ce8e2c00d546" }, "source": [ "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])" ], "id": "472nlbEX6kxN", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 San Francisco\n", "1 San Jose\n", "2 Sacramento\n", "dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 34 } ] }, { "cell_type": "markdown", "metadata": { "id": "Ijfpm53w6kxO" }, "source": [ "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:" ], "id": "Ijfpm53w6kxO" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mAgvwbWu6kxO", "outputId": "e64edd06-0506-476b-94ef-3c3030b74219" }, "source": [ "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n", "population = pd.Series([852469, 1015785, 485199])\n", "\n", "pd.DataFrame({ 'City name': city_names, 'Population': population })" ], "id": "mAgvwbWu6kxO", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
City namePopulation
0San Francisco852469
1San Jose1015785
2Sacramento485199
\n", "
" ], "text/plain": [ " City name Population\n", "0 San Francisco 852469\n", "1 San Jose 1015785\n", "2 Sacramento 485199" ] }, "metadata": { "tags": [] }, "execution_count": 35 } ] }, { "cell_type": "markdown", "metadata": { "id": "L82F07sU6kxP" }, "source": [ "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data." ], "id": "L82F07sU6kxP" }, { "cell_type": "code", "metadata": { "id": "av6RYOraVG1V", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "714a3628-1a54-45fb-fb0b-4a7faedc3643" }, "source": [ "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n", "california_housing_dataframe.describe()" ], "id": "av6RYOraVG1V", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count17000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.000000
mean-119.56210835.62522528.5893532643.664412539.4108241429.573941501.2219413.883578207300.912353
std2.0051662.13734012.5869372179.947071421.4994521147.852959384.5208411.908157115983.764387
min-124.35000032.5400001.0000002.0000001.0000003.0000001.0000000.49990014999.000000
25%-121.79000033.93000018.0000001462.000000297.000000790.000000282.0000002.566375119400.000000
50%-118.49000034.25000029.0000002127.000000434.0000001167.000000409.0000003.544600180400.000000
75%-118.00000037.72000037.0000003151.250000648.2500001721.000000605.2500004.767000265000.000000
max-114.31000041.95000052.00000037937.0000006445.00000035682.0000006082.00000015.000100500001.000000
\n", "
" ], "text/plain": [ " longitude latitude ... median_income median_house_value\n", "count 17000.000000 17000.000000 ... 17000.000000 17000.000000\n", "mean -119.562108 35.625225 ... 3.883578 207300.912353\n", "std 2.005166 2.137340 ... 1.908157 115983.764387\n", "min -124.350000 32.540000 ... 0.499900 14999.000000\n", "25% -121.790000 33.930000 ... 2.566375 119400.000000\n", "50% -118.490000 34.250000 ... 3.544600 180400.000000\n", "75% -118.000000 37.720000 ... 4.767000 265000.000000\n", "max -114.310000 41.950000 ... 15.000100 500001.000000\n", "\n", "[8 rows x 9 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 36 } ] }, { "cell_type": "markdown", "metadata": { "id": "WrkBjfz5kEQu" }, "source": [ "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:" ], "id": "WrkBjfz5kEQu" }, { "cell_type": "code", "metadata": { "id": "s3ND3bgOkB5k", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0c9393a4-6514-40c4-b49a-30167cd9f39b" }, "source": [ "california_housing_dataframe.head()" ], "id": "s3ND3bgOkB5k", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
0-114.3134.1915.05612.01283.01015.0472.01.493666900.0
1-114.4734.4019.07650.01901.01129.0463.01.820080100.0
2-114.5633.6917.0720.0174.0333.0117.01.650985700.0
3-114.5733.6414.01501.0337.0515.0226.03.191773400.0
4-114.5733.5720.01454.0326.0624.0262.01.925065500.0
\n", "
" ], "text/plain": [ " longitude latitude ... median_income median_house_value\n", "0 -114.31 34.19 ... 1.4936 66900.0\n", "1 -114.47 34.40 ... 1.8200 80100.0\n", "2 -114.56 33.69 ... 1.6509 85700.0\n", "3 -114.57 33.64 ... 3.1917 73400.0\n", "4 -114.57 33.57 ... 1.9250 65500.0\n", "\n", "[5 rows x 9 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 37 } ] }, { "cell_type": "markdown", "metadata": { "id": "w9-Es5Y6laGd" }, "source": [ "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:" ], "id": "w9-Es5Y6laGd" }, { "cell_type": "code", "metadata": { "id": "nqndFVXVlbPN", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "2f7aaae6-d2e5-4426-afd1-2f75e86cbcfd" }, "source": [ "california_housing_dataframe.hist('housing_median_age')" ], "id": "nqndFVXVlbPN", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "metadata": { "tags": [] }, "execution_count": 38 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "XtYZ7114n3b-" }, "source": [ "### Accessing Data\n", "\n", "You can access `DataFrame` data using familiar Python dict/list operations:" ], "id": "XtYZ7114n3b-" }, { "cell_type": "code", "metadata": { "id": "_TFm7-looBFF", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "92873b3d-7aee-4b88-c480-345700a69497" }, "source": [ "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n", "print(type(cities['City name']))\n", "cities['City name']" ], "id": "_TFm7-looBFF", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/plain": [ "0 San Francisco\n", "1 San Jose\n", "2 Sacramento\n", "Name: City name, dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 39 } ] }, { "cell_type": "code", "metadata": { "id": "V5L6xacLoxyv", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0960eed7-a819-4601-b589-05a0d7f57f83" }, "source": [ "print(type(cities['City name'][1]))\n", "cities['City name'][1]" ], "id": "V5L6xacLoxyv", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'San Jose'" ] }, "metadata": { "tags": [] }, "execution_count": 40 } ] }, { "cell_type": "code", "metadata": { "id": "gcYX1tBPugZl", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4f1a671f-fa08-4a8e-f40d-fbbc9da5cbd3" }, "source": [ "print(type(cities[0:2]))\n", "cities[0:2]" ], "id": "gcYX1tBPugZl", "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
City namePopulation
0San Francisco852469
1San Jose1015785
\n", "
" ], "text/plain": [ " City name Population\n", "0 San Francisco 852469\n", "1 San Jose 1015785" ] }, "metadata": { "tags": [] }, "execution_count": 41 } ] }, { "cell_type": "markdown", "metadata": { "id": "65g1ZdGVjXsQ" }, "source": [ "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here." ], "id": "65g1ZdGVjXsQ" }, { "cell_type": "markdown", "metadata": { "id": "RM1iaD-ka3Y1" }, "source": [ "### Manipulating Data\n", "\n", "You may apply Python's basic arithmetic operations to `Series`. For example:" ], "id": "RM1iaD-ka3Y1" }, { "cell_type": "code", "metadata": { "id": "XWmyCFJ5bOv-", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c9beb5c6-5a6b-4c28-dd32-a25db90718df" }, "source": [ "population / 1000." ], "id": "XWmyCFJ5bOv-", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 852.469\n", "1 1015.785\n", "2 485.199\n", "dtype: float64" ] }, "metadata": { "tags": [] }, "execution_count": 42 } ] }, { "cell_type": "markdown", "metadata": { "id": "ZeYYLoV9b9fB" }, "source": [ "\n", "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:" ], "id": "ZeYYLoV9b9fB" }, { "cell_type": "code", "metadata": { "id": "0gCEX99Hb8LR", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d4bbdcf1-e2d6-412c-bbed-19ff303e1119" }, "source": [ "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n", "cities['Population density'] = cities['Population'] / cities['Area square miles']\n", "cities" ], "id": "0gCEX99Hb8LR", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
City namePopulationArea square milesPopulation density
0San Francisco85246946.8718187.945381
1San Jose1015785176.535754.177760
2Sacramento48519997.924955.055147
\n", "
" ], "text/plain": [ " City name Population Area square miles Population density\n", "0 San Francisco 852469 46.87 18187.945381\n", "1 San Jose 1015785 176.53 5754.177760\n", "2 Sacramento 485199 97.92 4955.055147" ] }, "metadata": { "tags": [] }, "execution_count": 43 } ] }, { "cell_type": "markdown", "metadata": { "id": "s6HmFYSM8_tP" }, "source": [ "### Indexes\n", "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n", "\n", "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered." ], "id": "s6HmFYSM8_tP" }, { "cell_type": "code", "metadata": { "id": "2684gsWNinq9", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f0a638fe-de3e-4072-b43b-d70919d99f9a" }, "source": [ "city_names.index" ], "id": "2684gsWNinq9", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RangeIndex(start=0, stop=3, step=1)" ] }, "metadata": { "tags": [] }, "execution_count": 44 } ] }, { "cell_type": "code", "metadata": { "id": "F_qPe2TBjfWd", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f7320007-88c8-4484-8e6f-4c544d7a430a" }, "source": [ "cities.index" ], "id": "F_qPe2TBjfWd", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RangeIndex(start=0, stop=3, step=1)" ] }, "metadata": { "tags": [] }, "execution_count": 45 } ] }, { "cell_type": "markdown", "metadata": { "id": "hp2oWY9Slo_h" }, "source": [ "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:" ], "id": "hp2oWY9Slo_h" }, { "cell_type": "code", "metadata": { "id": "sN0zUzSAj-U1", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "66ff6fbe-5891-4e46-d18e-f862a64d718c" }, "source": [ "cities.reindex([2, 0, 1])" ], "id": "sN0zUzSAj-U1", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
City namePopulationArea square milesPopulation density
2Sacramento48519997.924955.055147
0San Francisco85246946.8718187.945381
1San Jose1015785176.535754.177760
\n", "
" ], "text/plain": [ " City name Population Area square miles Population density\n", "2 Sacramento 485199 97.92 4955.055147\n", "0 San Francisco 852469 46.87 18187.945381\n", "1 San Jose 1015785 176.53 5754.177760" ] }, "metadata": { "tags": [] }, "execution_count": 46 } ] }, { "cell_type": "markdown", "metadata": { "id": "U0ynHLEv-PfU" }, "source": [ "## **SciPy**\n", "\n", "Reference [J.R. Johansson scientific-python-lectures](https://colab.research.google.com/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-3-Scipy.ipynb#scrollTo=UNBywy94-DOE)" ], "id": "U0ynHLEv-PfU" }, { "cell_type": "code", "metadata": { "id": "ot1L-Sn9-DOD" }, "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from IPython.display import Image" ], "id": "ot1L-Sn9-DOD", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lhheU7xc-DOF" }, "source": [ "The SciPy framework builds on top of the low-level NumPy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Some of the topics that SciPy covers are:\n", "\n", "* Special functions ([scipy.special](http://docs.scipy.org/doc/scipy/reference/special.html))\n", "* Integration ([scipy.integrate](http://docs.scipy.org/doc/scipy/reference/integrate.html))\n", "* Optimization ([scipy.optimize](http://docs.scipy.org/doc/scipy/reference/optimize.html))\n", "* Interpolation ([scipy.interpolate](http://docs.scipy.org/doc/scipy/reference/interpolate.html))\n", "* Fourier Transforms ([scipy.fftpack](http://docs.scipy.org/doc/scipy/reference/fftpack.html))\n", "* Signal Processing ([scipy.signal](http://docs.scipy.org/doc/scipy/reference/signal.html))\n", "* Linear Algebra ([scipy.linalg](http://docs.scipy.org/doc/scipy/reference/linalg.html))\n", "* Sparse Eigenvalue Problems ([scipy.sparse](http://docs.scipy.org/doc/scipy/reference/sparse.html))\n", "* Statistics ([scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html))\n", "* Multi-dimensional image processing ([scipy.ndimage](http://docs.scipy.org/doc/scipy/reference/ndimage.html))\n", "* File IO ([scipy.io](http://docs.scipy.org/doc/scipy/reference/io.html))\n", "\n", "Each of these submodules provides a number of functions and classes that can be used to solve problems in their respective topics.\n", "\n", "To access the SciPy package in a Python program, we start by importing everything from the `scipy` module." ], "id": "lhheU7xc-DOF" }, { "cell_type": "code", "metadata": { "id": "t3Ooh82T-DOQ" }, "source": [ "from scipy import *" ], "id": "t3Ooh82T-DOQ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "C-FKfjEv-DOR" }, "source": [ "If we only need to use part of the SciPy framework we can selectively include only those modules we are interested in. For example, to include the linear algebra package under the name `la`, we can do:" ], "id": "C-FKfjEv-DOR" }, { "cell_type": "code", "metadata": { "id": "hQb054p4-DOS" }, "source": [ "import scipy.linalg as la" ], "id": "hQb054p4-DOS", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "8Or8fRUY_gPI" }, "source": [ "### Special functions" ], "id": "8Or8fRUY_gPI" }, { "cell_type": "markdown", "metadata": { "id": "_p-FPWMR_gPN" }, "source": [ "A large number of mathematical special functions are important for many computional physics problems. SciPy provides implementations of a very extensive set of special functions. For details, see the list of functions in the reference documention at http://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special. \n", "\n", "To demonstrate the typical usage of special functions we will look in more detail at the Bessel functions:" ], "id": "_p-FPWMR_gPN" }, { "cell_type": "markdown", "metadata": { "id": "w727DnK4-DOc" }, "source": [ "### Integration" ], "id": "w727DnK4-DOc" }, { "cell_type": "markdown", "metadata": { "id": "v7kM_SZV-DOc" }, "source": [ "#### Numerical integration: quadrature" ], "id": "v7kM_SZV-DOc" }, { "cell_type": "markdown", "metadata": { "id": "SSPId-X3-DOd" }, "source": [ "Numerical evaluation of a function of the type\n", "\n", "$\\displaystyle \\int_a^b f(x) dx$\n", "\n", "is called *numerical quadrature*, or simply *quadature*. SciPy provides a series of functions for different kind of quadrature, for example the `quad`, `dblquad` and `tplquad` for single, double and triple integrals, respectively.\n", "\n" ], "id": "SSPId-X3-DOd" }, { "cell_type": "code", "metadata": { "id": "MhE0AQil-DOd" }, "source": [ "from scipy.integrate import quad, dblquad, tplquad" ], "id": "MhE0AQil-DOd", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "F-qZ5KpI-DOn" }, "source": [ "### Ordinary differential equations (ODEs)" ], "id": "F-qZ5KpI-DOn" }, { "cell_type": "markdown", "metadata": { "id": "6RoNvCaY-DOo" }, "source": [ "SciPy provides two different ways to solve ODEs: An API based on the function `odeint`, and object-oriented API based on the class `ode`. Usually `odeint` is easier to get started with, but the `ode` class offers some finer level of control.\n", "\n", "Here we will use the `odeint` functions. For more information about the class `ode`, try `help(ode)`. It does pretty much the same thing as `odeint`, but in an object-oriented fashion.\n", "\n", "For example, to use `odeint`, import it from the `scipy.integrate` module" ], "id": "6RoNvCaY-DOo" }, { "cell_type": "code", "metadata": { "id": "W3dCJkmt-DOo" }, "source": [ "from scipy.integrate import odeint, ode" ], "id": "W3dCJkmt-DOo", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jF4qyrWx-DO1" }, "source": [ "### Fourier transform" ], "id": "jF4qyrWx-DO1" }, { "cell_type": "markdown", "metadata": { "id": "S6n0KaUp-DO1" }, "source": [ "Fourier transforms are one of the universal tools in computational physics, which appear over and over again in different contexts. SciPy provides functions for accessing the classic [FFTPACK](http://www.netlib.org/fftpack/) library from NetLib, which is an efficient and well tested FFT library written in FORTRAN. The SciPy API has a few additional convenience functions, but overall the API is closely related to the original FORTRAN library.\n", "\n", "To use the `fftpack` module in a python program, include it using:" ], "id": "S6n0KaUp-DO1" }, { "cell_type": "code", "metadata": { "id": "QNDx_xga-DO2" }, "source": [ "from numpy.fft import fftfreq\n", "from scipy.fftpack import *" ], "id": "QNDx_xga-DO2", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "RkjWVTHu-DO6" }, "source": [ "### Linear algebra" ], "id": "RkjWVTHu-DO6" }, { "cell_type": "markdown", "metadata": { "id": "zgYRfZTS-DO6" }, "source": [ "The linear algebra module contains a lot of matrix related functions, including linear equation solving, eigenvalue solvers, matrix functions (for example matrix-exponentiation), a number of different decompositions (SVD, LU, cholesky), etc. \n", "\n", "Detailed documetation is available at: http://docs.scipy.org/doc/scipy/reference/linalg.html\n", "\n" ], "id": "zgYRfZTS-DO6" }, { "cell_type": "markdown", "metadata": { "id": "iHMqBtt8-DPd" }, "source": [ "### Optimization" ], "id": "iHMqBtt8-DPd" }, { "cell_type": "markdown", "metadata": { "id": "Mrytp8a--DPd" }, "source": [ "Optimization (finding minima or maxima of a function) is a large field in mathematics, and optimization of complicated functions or in many variables can be rather involved. Here we will only look at a few very simple cases. For a more detailed introduction to optimization with SciPy see: http://scipy-lectures.github.com/advanced/mathematical_optimization/index.html\n", "\n", "To use the optimization module in scipy first include the `optimize` module:" ], "id": "Mrytp8a--DPd" }, { "cell_type": "code", "metadata": { "id": "WFMKeEXI-DPe" }, "source": [ "from scipy import optimize" ], "id": "WFMKeEXI-DPe", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "bccDMQzq-DPo" }, "source": [ "### Interpolation" ], "id": "bccDMQzq-DPo" }, { "cell_type": "markdown", "metadata": { "id": "VANZkmg6-DPp" }, "source": [ "Interpolation is simple and convenient in scipy: The `interp1d` function, when given arrays describing X and Y data, returns and object that behaves like a function that can be called for an arbitrary value of x (in the range covered by X), and it returns the corresponding interpolated y value:" ], "id": "VANZkmg6-DPp" }, { "cell_type": "markdown", "metadata": { "id": "hto5zbSL-DPr" }, "source": [ "### Statistics" ], "id": "hto5zbSL-DPr" }, { "cell_type": "markdown", "metadata": { "id": "DLxyl93J-DPr" }, "source": [ "The `scipy.stats` module contains a large number of statistical distributions, statistical functions and tests. For a complete documentation of its features, see http://docs.scipy.org/doc/scipy/reference/stats.html.\n", "\n", "There is also a very powerful python package for statistical modelling called statsmodels. See http://statsmodels.sourceforge.net for more details." ], "id": "DLxyl93J-DPr" }, { "cell_type": "markdown", "metadata": { "id": "tEINf4bEL9jR" }, "source": [ "##Matplotlib" ], "id": "tEINf4bEL9jR" }, { "cell_type": "markdown", "metadata": { "id": "0hgVWLaXL9jR" }, "source": [ "Matplotlib is a plotting library. In this section give a brief introduction to the `matplotlib.pyplot` module, which provides a plotting system similar to that of MATLAB." ], "id": "0hgVWLaXL9jR" }, { "cell_type": "code", "metadata": { "id": "cmh_7c6KL9jR" }, "source": [ "import matplotlib.pyplot as plt" ], "id": "cmh_7c6KL9jR", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jOsaA5hGL9jS" }, "source": [ "By running this special iPython command, we will be displaying plots inline:" ], "id": "jOsaA5hGL9jS" }, { "cell_type": "code", "metadata": { "id": "ijpsmwGnL9jT" }, "source": [ "%matplotlib inline" ], "id": "ijpsmwGnL9jT", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "U5Z_oMoLL9jV" }, "source": [ "###Plotting" ], "id": "U5Z_oMoLL9jV" }, { "cell_type": "markdown", "metadata": { "id": "6QyFJ7dhL9jV" }, "source": [ "The most important function in `matplotlib` is plot, which allows you to plot 2D data. Here is a simple example:" ], "id": "6QyFJ7dhL9jV" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pua52BGeL9jW", "outputId": "02d2486a-b148-4557-84a3-e558c0c2e7b3" }, "source": [ "# Compute the x and y coordinates for points on a sine curve\n", "x = np.arange(0, 3 * np.pi, 0.1)\n", "y = np.sin(x)\n", "\n", "# Plot the points using matplotlib\n", "plt.plot(x, y)" ], "id": "pua52BGeL9jW", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, "metadata": { "tags": [] }, "execution_count": 56 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "9W2VAcLiL9jX" }, "source": [ "With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend, and axis labels:" ], "id": "9W2VAcLiL9jX" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TfCQHJ5AL9jY", "outputId": "33fea410-20b1-4612-bba5-07e4f93d8dde" }, "source": [ "y_sin = np.sin(x)\n", "y_cos = np.cos(x)\n", "\n", "# Plot the points using matplotlib\n", "plt.plot(x, y_sin)\n", "plt.plot(x, y_cos)\n", "plt.xlabel('x axis label')\n", "plt.ylabel('y axis label')\n", "plt.title('Sine and Cosine')\n", "plt.legend(['Sine', 'Cosine'])" ], "id": "TfCQHJ5AL9jY", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 57 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "R5IeAY03L9ja" }, "source": [ "###Subplots " ], "id": "R5IeAY03L9ja" }, { "cell_type": "markdown", "metadata": { "id": "CfUzwJg0L9ja" }, "source": [ "You can plot different things in the same figure using the subplot function. Here is an example:" ], "id": "CfUzwJg0L9ja" }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dM23yGH9L9ja", "outputId": "d864071a-f186-4b19-aec6-90d0b801ad2a" }, "source": [ "# Compute the x and y coordinates for points on sine and cosine curves\n", "x = np.arange(0, 3 * np.pi, 0.1)\n", "y_sin = np.sin(x)\n", "y_cos = np.cos(x)\n", "\n", "# Set up a subplot grid that has height 2 and width 1,\n", "# and set the first such subplot as active.\n", "plt.subplot(2, 1, 1)\n", "\n", "# Make the first plot\n", "plt.plot(x, y_sin)\n", "plt.title('Sine')\n", "\n", "# Set the second subplot as active, and make the second plot.\n", "plt.subplot(2, 1, 2)\n", "plt.plot(x, y_cos)\n", "plt.title('Cosine')\n", "\n", "# Show the figure.\n", "plt.show()" ], "id": "dM23yGH9L9ja", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "gLtsST5SL9jc" }, "source": [ "You can read much more about the `subplot` function in the [documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot)." ], "id": "gLtsST5SL9jc" }, { "cell_type": "markdown", "metadata": { "id": "oQOE7l55CjDb" }, "source": [ "## Seaborn\n", "\n", "Reference: [Visualization-With-Seaborn ](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb)\n", "\n", "[Seaborn](http://seaborn.pydata.org/). Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas ``DataFrame``s." ], "id": "oQOE7l55CjDb" }, { "cell_type": "markdown", "metadata": { "id": "9XgpYu5cCDpw" }, "source": [ "### Seaborn Versus Matplotlib\n", "\n", "Here is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors.\n", "We start with the typical imports:" ], "id": "9XgpYu5cCDpw" }, { "cell_type": "code", "metadata": { "collapsed": true, "id": "9EI1yPflCDpx" }, "source": [ "import matplotlib.pyplot as plt\n", "plt.style.use('classic')\n", "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd" ], "id": "9EI1yPflCDpx", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3LtrvkccCDpy" }, "source": [ "Now we create some random walk data:" ], "id": "3LtrvkccCDpy" }, { "cell_type": "code", "metadata": { "collapsed": true, "id": "PqcvRhU0CDpy" }, "source": [ "# Create some data\n", "rng = np.random.RandomState(0)\n", "x = np.linspace(0, 10, 500)\n", "y = np.cumsum(rng.randn(500, 6), 0)" ], "id": "PqcvRhU0CDpy", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mfLYZpVBCDpz" }, "source": [ "And do a simple plot:" ], "id": "mfLYZpVBCDpz" }, { "cell_type": "code", "metadata": { "id": "1_9vMwoHCDp0", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "2302119d-54da-41e4-ead5-041be7de8da9" }, "source": [ "# Plot the data with Matplotlib defaults\n", "plt.plot(x, y)\n", "plt.legend('ABCDEF', ncol=2, loc='upper left');" ], "id": "1_9vMwoHCDp0", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "233BUwDDCDp2" }, "source": [ "Although the result contains all the information we'd like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit old-fashioned in the context of 21st-century data visualization.\n", "\n", "Now let's take a look at how it works with Seaborn.\n", "As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib's default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output.\n", "We can set the style by calling Seaborn's ``set()`` method.\n", "By convention, Seaborn is imported as ``sns``:" ], "id": "233BUwDDCDp2" }, { "cell_type": "code", "metadata": { "id": "8xeodUGhCDp2" }, "source": [ "import seaborn as sns\n", "sns.set()" ], "id": "8xeodUGhCDp2", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Pqu1xVyZCDp3" }, "source": [ "Now let's rerun the same two lines as before:" ], "id": "Pqu1xVyZCDp3" }, { "cell_type": "code", "metadata": { "id": "wF3FRlJXCDp4", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "47df5495-5119-47f1-b5d9-024db8c51d48" }, "source": [ "# same plotting code as above!\n", "plt.plot(x, y)\n", "plt.legend('ABCDEF', ncol=2, loc='upper left');" ], "id": "wF3FRlJXCDp4", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] } } ] }, { "cell_type": "markdown", "metadata": { "id": "BFWhdAC2D1XY" }, "source": [ "### Exploring Seaborn Plots\n", "\n", "The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.\n", "\n", "Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following *could* be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood) but the Seaborn API is much more convenient." ], "id": "BFWhdAC2D1XY" }, { "cell_type": "markdown", "metadata": { "id": "RWMXqvfuCDp5" }, "source": [ "#### Histograms, KDE, and densities\n", "\n", "Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables.\n", "We have seen that this is relatively straightforward in Matplotlib:" ], "id": "RWMXqvfuCDp5" }, { "cell_type": "code", "metadata": { "id": "dGpJyFVKCDp6", "colab": { "base_uri": "https://localhost:8080/", "height": 613 }, "outputId": "9ff73500-88f9-4cc3-c687-ca7a8bc0d338" }, "source": [ "data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)\n", "data = pd.DataFrame(data, columns=['x', 'y'])\n", "\n", "for col in 'xy':\n", " plt.hist(data[col], normed=True, alpha=0.5)" ], "id": "dGpJyFVKCDp6", "execution_count": null, "outputs": [ { "output_type": "error", "ename": "AttributeError", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m'xy'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnormed\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0.5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/pyplot.py\u001b[0m in \u001b[0;36mhist\u001b[0;34m(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs)\u001b[0m\n\u001b[1;32m 2608\u001b[0m \u001b[0malign\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0malign\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morientation\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morientation\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrwidth\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrwidth\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlog\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2609\u001b[0m color=color, label=label, stacked=stacked, **({\"data\": data}\n\u001b[0;32m-> 2610\u001b[0;31m if data is not None else {}), **kwargs)\n\u001b[0m\u001b[1;32m 2611\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2612\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/__init__.py\u001b[0m in \u001b[0;36minner\u001b[0;34m(ax, data, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1563\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0minner\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0max\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1564\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1565\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0max\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msanitize_sequence\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1566\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1567\u001b[0m \u001b[0mbound\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnew_sig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0max\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/axes/_axes.py\u001b[0m in \u001b[0;36mhist\u001b[0;34m(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)\u001b[0m\n\u001b[1;32m 6817\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mpatch\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6818\u001b[0m \u001b[0mp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpatch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 6819\u001b[0;31m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6820\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlbl\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6821\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mset_label\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlbl\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/artist.py\u001b[0m in \u001b[0;36mupdate\u001b[0;34m(self, props)\u001b[0m\n\u001b[1;32m 1004\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1005\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mcbook\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_setattr_cm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0meventson\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1006\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0m_update_property\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mprops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1007\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1008\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mret\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/artist.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1004\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1005\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mcbook\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_setattr_cm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0meventson\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1006\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0m_update_property\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mprops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1007\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1008\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mret\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/matplotlib/artist.py\u001b[0m in \u001b[0;36m_update_property\u001b[0;34m(self, k, v)\u001b[0m\n\u001b[1;32m 1000\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1001\u001b[0m raise AttributeError('{!r} object has no property {!r}'\n\u001b[0;32m-> 1002\u001b[0;31m .format(type(self).__name__, k))\n\u001b[0m\u001b[1;32m 1003\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1004\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAttributeError\u001b[0m: 'Rectangle' object has no property 'normed'" ] }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] } } ] }, { "cell_type": "markdown", "metadata": { "id": "XfYxJhYdCDp6" }, "source": [ "Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with ``sns.kdeplot``:" ], "id": "XfYxJhYdCDp6" }, { "cell_type": "code", "metadata": { "id": "LM6phBMbCDp7" }, "source": [ "for col in 'xy':\n", " sns.kdeplot(data[col], shade=True)" ], "id": "LM6phBMbCDp7", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WrHeHhiECDp7" }, "source": [ "Histograms and KDE can be combined using ``distplot``:" ], "id": "WrHeHhiECDp7" }, { "cell_type": "code", "metadata": { "id": "SpdBAq5dCDp8" }, "source": [ "sns.distplot(data['x'])\n", "sns.distplot(data['y']);" ], "id": "SpdBAq5dCDp8", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "aUh_AYukCDp8" }, "source": [ "If we pass the full two-dimensional dataset to ``kdeplot``, we will get a two-dimensional visualization of the data:" ], "id": "aUh_AYukCDp8" }, { "cell_type": "code", "metadata": { "id": "gxBhJMl0CDp9" }, "source": [ "sns.kdeplot(data);" ], "id": "gxBhJMl0CDp9", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "g1O0n3wUCDp9" }, "source": [ "We can see the joint distribution and the marginal distributions together using ``sns.jointplot``.\n", "For this plot, we'll set the style to a white background:" ], "id": "g1O0n3wUCDp9" }, { "cell_type": "code", "metadata": { "id": "k2gzsdSACDp-" }, "source": [ "with sns.axes_style('white'):\n", " sns.jointplot(\"x\", \"y\", data, kind='kde');" ], "id": "k2gzsdSACDp-", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tkyWoazwCDp-" }, "source": [ "There are other parameters that can be passed to ``jointplot``—for example, we can use a hexagonally based histogram instead:" ], "id": "tkyWoazwCDp-" }, { "cell_type": "code", "metadata": { "id": "dKRVv5lwCDp_" }, "source": [ "with sns.axes_style('white'):\n", " sns.jointplot(\"x\", \"y\", data, kind='hex')" ], "id": "dKRVv5lwCDp_", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "oHSffVJsCDp_" }, "source": [ "#### Pair plots\n", "\n", "When you generalize joint plots to datasets of larger dimensions, you end up with *pair plots*. This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.\n", "\n", "We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:" ], "id": "oHSffVJsCDp_" }, { "cell_type": "code", "metadata": { "id": "95rKSsXVCDqA" }, "source": [ "iris = sns.load_dataset(\"iris\")\n", "iris.head()" ], "id": "95rKSsXVCDqA", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lgPWAzZiCDqA" }, "source": [ "Visualizing the multidimensional relationships among the samples is as easy as calling ``sns.pairplot``:" ], "id": "lgPWAzZiCDqA" }, { "cell_type": "code", "metadata": { "id": "oSPxC2V7CDqB" }, "source": [ "sns.pairplot(iris, hue='species', size=2.5);" ], "id": "oSPxC2V7CDqB", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "CIA5C-CNFxf3" }, "source": [ "## Plotly\n", "\n", "Plotly Python is an open-source library built on top of plotly.js which allows users to create professional quality, interactive, web-based or standalone visualizations or applications.
\n", "Visualizations can be displayed in Jupyter notebooks, standalone HTML files, or integrated into web-applications via the Dash framework.\n", "Over 40 unique charts and limitless customization options exist across statistical, financial, geographic, scientific, and 3-D plot types. (from plotly website)" ], "id": "CIA5C-CNFxf3" }, { "cell_type": "markdown", "metadata": { "id": "qs9Dbvw7GdBT" }, "source": [ "There are a couple more notes in regards to Plotly that is needed to be shared:\n", "\n", "Plotly - Online
\n", "Plotly - Enterprise\n" ], "id": "qs9Dbvw7GdBT" }, { "cell_type": "markdown", "metadata": { "id": "YQoCBrr1G1Db" }, "source": [ "## Intro to Model Building and inference frameworks" ], "id": "YQoCBrr1G1Db" }, { "cell_type": "markdown", "metadata": { "id": "_kK4RhV_IVn5" }, "source": [ "### scikit-learn\n", "\n", "Reference: [Scientific Python Stanford University](https://colab.research.google.com/drive/188vQVvDIH-9kbd4Un6BcPERKRztxIORe)" ], "id": "_kK4RhV_IVn5" }, { "cell_type": "markdown", "metadata": { "id": "JAZVbXHdIpCT" }, "source": [ "[Scikit-learn](https://scikit-learn.org/stable/) is a library that allows you to do machine learning, that is, make predictions from data, in Python. There are four basic machine learning tasks:\n", "\n", " 1. Regression: predict a number from datapoints, given datapoints and corresponding numbers\n", " 2. Classification: predict a category from datapoints, given datapoints and corresponding numbers\n", " 3. Clustering: predict a category from datapoints, given only datapoints\n", " 4. Dimensionality reduction: make datapoints lower-dimensional so that we can visualize the data\n", "\n", "Here is a [handy flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) of when to use each technique." ], "id": "JAZVbXHdIpCT" }, { "cell_type": "markdown", "metadata": { "id": "iCkDyIHpJF3-" }, "source": [ "![](https://scikit-learn.org/stable/_static/ml_map.png)" ], "id": "iCkDyIHpJF3-" }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "UvOyrbRBIOix" }, "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ], "id": "UvOyrbRBIOix", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "j6k2NWBgIOiy" }, "source": [ "#### Regression\n", "Abalone are a type of edible marine snail, and they have internal rings that correspond to their age (like trees). In the following, we will use a dataset of [abalone measurements](https://archive.ics.uci.edu/ml/datasets/abalone). It has the following fields:\n", "\n", " Sex / nominal / -- / M, F, and I (infant) \n", " Length / continuous / mm / Longest shell measurement \n", " Diameter\t/ continuous / mm / perpendicular to length \n", " Height / continuous / mm / with meat in shell \n", " Whole weight / continuous / grams / whole abalone \n", " Shucked weight / continuous\t/ grams / weight of meat \n", " Viscera weight / continuous / grams / gut weight (after bleeding) \n", " Shell weight / continuous / grams / after being dried \n", " Rings / integer / -- / +1.5 gives the age in years \n", "\n", "Suppose we are interested in predicting the age of the abalone given their measurements. This is an example of a regression problem." ], "id": "j6k2NWBgIOiy" }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "QYFz8g72IOi1" }, "source": [ "df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',\n", " header=None, names=['sex', 'length', 'diameter', 'height', 'weight', 'shucked_weight',\n", " 'viscera_weight', 'shell_weight', 'rings'])" ], "id": "QYFz8g72IOi1", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": false, "id": "LBuF5ZJ8IOi1" }, "source": [ "df.head()" ], "id": "LBuF5ZJ8IOi1", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "ahefjnACIOi2" }, "source": [ "df.describe()" ], "id": "ahefjnACIOi2", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "Q3o2m9qmIOi3" }, "source": [ "df['rings'].plot(kind='hist')" ], "id": "Q3o2m9qmIOi3", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": false, "id": "P_G245IbIOi3" }, "source": [ "df.plot('weight', 'rings', kind='scatter')" ], "id": "P_G245IbIOi3", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "BDgvb4nlIOi4" }, "source": [ "X = df[['weight']].to_numpy()\n", "y = df['rings'].to_numpy()" ], "id": "BDgvb4nlIOi4", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "mDk7dTMJIOi4" }, "source": [ "X" ], "id": "mDk7dTMJIOi4", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "wOEmW5I8IOi5" }, "source": [ "y" ], "id": "wOEmW5I8IOi5", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "eyxohibHIOi5" }, "source": [ "from sklearn import linear_model\n", "model = linear_model.LinearRegression()\n", "model.fit(X, y)\n", "print(model.coef_, model.intercept_)" ], "id": "eyxohibHIOi5", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "l5FPsT5aIOi6" }, "source": [ "print(model.score(X, y))" ], "id": "l5FPsT5aIOi6", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "92R_i04dIOi6" }, "source": [ "model.predict(np.array([[1.5], [2.2]]))" ], "id": "92R_i04dIOi6", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "fgM_9fdIIOi7" }, "source": [ "df.plot('weight', 'rings', kind='scatter')\n", "\n", "weight = np.linspace(0, 3, 10).reshape(-1, 1)\n", "plt.plot(weight, model.predict(weight), 'r')" ], "id": "fgM_9fdIIOi7", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "BzvMWHgmIOi7" }, "source": [ "df['root_weight'] = np.sqrt(df['weight'])" ], "id": "BzvMWHgmIOi7", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "-4pmQ1XQIOi8" }, "source": [ "X = df[['weight','root_weight']].to_numpy()\n", "y = df['rings'].to_numpy()\n", "model = linear_model.LinearRegression()\n", "model.fit(X, y)" ], "id": "-4pmQ1XQIOi8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "xI1bvJryIOi8" }, "source": [ "weight = np.linspace(0, 3, 100).reshape(-1, 1)\n", "root_weight = np.sqrt(weight)\n", "\n", "features = np.hstack((weight,root_weight))" ], "id": "xI1bvJryIOi8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Wk3lrU6qIOi8" }, "source": [ "df.plot('weight', 'rings', kind='scatter')\n", "\n", "plt.plot(weight, model.predict(features), 'r')" ], "id": "Wk3lrU6qIOi8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "h_bz_h27IOi9" }, "source": [ "model.score(X,y)" ], "id": "h_bz_h27IOi9", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Aqszb3RuIOi9" }, "source": [ "plt.hist2d(df['weight'],df['rings'],bins=(50,30));" ], "id": "Aqszb3RuIOi9", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lJYf3H2NKXWi" }, "source": [ "#### Classification\n", "\n", "Another example of a machine learning problem is classification. Here we will use a dataset of flower measurements from three different flower species of *Iris* (*Iris setosa*, *Iris virginica*, and *Iris versicolor*). We aim to predict the species of the flower. Because the species is not a numerical output, it is not a regression problem, but a classification problem." ], "id": "lJYf3H2NKXWi" }, { "cell_type": "code", "metadata": { "id": "nv_ER6FDKXWi" }, "source": [ "from sklearn import datasets\n", "iris = datasets.load_iris()\n", "print(iris.DESCR)" ], "id": "nv_ER6FDKXWi", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "jDkdMhtdKXWj" }, "source": [ "X = iris.data[:, 2:]\n", "y = iris.target_names[iris.target]\n", "for name in iris.target_names:\n", " plt.scatter(X[y == name, 0], X[y == name, 1], label=name)\n", "plt.xlabel('Petal length')\n", "plt.ylabel('Petal width')\n", "plt.legend();" ], "id": "jDkdMhtdKXWj", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "PfJrRE_UKXWj" }, "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n", "print(X_train.shape, y_train.shape)\n", "print(X_test.shape, y_test.shape)" ], "id": "PfJrRE_UKXWj", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": false, "id": "kiTMK74oKXWk" }, "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "model = KNeighborsClassifier()\n", "model.fit(X_train, y_train)" ], "id": "kiTMK74oKXWk", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "XX1bV5TsKXWk" }, "source": [ "X_test" ], "id": "XX1bV5TsKXWk", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "PIe9aQKMKXWk" }, "source": [ "model.predict(X_test)" ], "id": "PIe9aQKMKXWk", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "vilx5oh3KXWl" }, "source": [ "import sklearn.metrics as metrics\n", "metrics.accuracy_score(model.predict(X_test), y_test)" ], "id": "vilx5oh3KXWl", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "649AqvvYKXWl" }, "source": [ "print(metrics.classification_report(model.predict(X_test), y_test))" ], "id": "649AqvvYKXWl", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "D33vfL2xKXWn" }, "source": [ "#### Clustering\n", "\n", "Clustering is useful if we don't have a dataset labelled with the categories we want to predict, but we nevertheless expect there to be a certain number of categories. For example, suppose we have the previous dataset, but we are missing the labels. We can use a clustering algorithm like k-means to *cluster* the datapoints. Because we don't have labels, clustering is what is called an **unsupervised learning** algorithm." ], "id": "D33vfL2xKXWn" }, { "cell_type": "code", "metadata": { "id": "lf7UIAXoMjPM" }, "source": [ "from sklearn.datasets import make_blobs\n", "X,y = make_blobs(centers=4, n_samples=200, random_state=0, cluster_std=0.7)\n", "print(X[:10],y[:10])" ], "id": "lf7UIAXoMjPM", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Lbo31hO8MjPN" }, "source": [ "Now we plot these points, but without coloring the points using the labels:" ], "id": "Lbo31hO8MjPN" }, { "cell_type": "code", "metadata": { "id": "WalBVWi0MjPN" }, "source": [ "plt.scatter(X[:,0],X[:,1]);" ], "id": "WalBVWi0MjPN", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4CV0HTccMjPO" }, "source": [ "We can still discern four clusters in the data set. Let's see if the k-means algorithm can recover these clusters. First we create the instance of the k-means model by giving it the number of clusters 4 as a hyperparameter." ], "id": "4CV0HTccMjPO" }, { "cell_type": "code", "metadata": { "id": "dSC5t9sJMjPO" }, "source": [ "from sklearn.cluster import KMeans\n", "model = KMeans(4)\n", "model.fit(X)\n", "print(model.cluster_centers_)" ], "id": "dSC5t9sJMjPO", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "dqqVQU2eMjPO" }, "source": [ "plt.scatter(X[:,0],X[:,1], c=model.labels_);\n", "plt.scatter(model.cluster_centers_[:,0], model.cluster_centers_[:,1], s=100, color=\"red\"); # Show the centres" ], "id": "dqqVQU2eMjPO", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Qhr6EyGuKXWs" }, "source": [ "#### Dimensionality reduction\n", "\n", "Dimensionality reduction is another unsupervised learning problem (that is, it does not require labels). It aims to project datapoints into a lower dimensional space while preserving distances between datapoints." ], "id": "Qhr6EyGuKXWs" }, { "cell_type": "markdown", "metadata": { "id": "tQ446lmW3kLM" }, "source": [ "---" ], "id": "tQ446lmW3kLM" }, { "cell_type": "markdown", "metadata": { "id": "260I-jj1N_jN" }, "source": [ "### Tensorflow\n", "\n", "TensorFlow is an end-to-end open source platform for machine learning.\n", "\n", "TensorFlow was originally created by Google as an internal machine learning tool, but an implementation of it was open sourced under the Apache 2.0 License in November 2015." ], "id": "260I-jj1N_jN" }, { "cell_type": "markdown", "metadata": { "id": "baA63HqQSDXE" }, "source": [ "Few reason for the popluarity of tensorflow\n", "\n", "* Python API\n", "* Portability: deploy computation to one or more CPUs or GPUs in a esktop, server, or mobile device with a single API\n", "* Flexibility: from Raspberry Pi, Android, Windows, iOS, Linux to server farms\n", "* Visualization\n", "* Checkpoints (for managing experiments)\n", "* Large community (> 10,000 commits and > 3000 TF-related repos in one year)\n", "* Awesome projects already using TensorFlow" ], "id": "baA63HqQSDXE" }, { "cell_type": "markdown", "metadata": { "id": "F9QR6fyVQ-wl" }, "source": [ "Reference: [Tensorflow beginner](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb)" ], "id": "F9QR6fyVQ-wl" }, { "cell_type": "code", "metadata": { "id": "0trJmd6DjqBZ" }, "source": [ "import tensorflow as tf" ], "id": "0trJmd6DjqBZ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7NAbSZiaoJ4z" }, "source": [ "Load and prepare the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). Convert the samples from integers to floating-point numbers:" ], "id": "7NAbSZiaoJ4z" }, { "cell_type": "code", "metadata": { "id": "7FP5258xjs-v" }, "source": [ "mnist = tf.keras.datasets.mnist\n", "\n", "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n", "x_train, x_test = x_train / 255.0, x_test / 255.0" ], "id": "7FP5258xjs-v", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "BPZ68wASog_I" }, "source": [ "Build the `tf.keras.Sequential` model by stacking layers. Choose an optimizer and loss function for training:" ], "id": "BPZ68wASog_I" }, { "cell_type": "code", "metadata": { "id": "h3IKyzTCDNGo" }, "source": [ "model = tf.keras.models.Sequential([\n", " tf.keras.layers.Flatten(input_shape=(28, 28)),\n", " tf.keras.layers.Dense(128, activation='relu'),\n", " tf.keras.layers.Dropout(0.2),\n", " tf.keras.layers.Dense(10)\n", "])" ], "id": "h3IKyzTCDNGo", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "9foNKHzTD2Vo" }, "source": [ "model.compile(optimizer='adam',\n", " loss='sparse_categorical_crossentropy',\n", " metrics=['accuracy'])" ], "id": "9foNKHzTD2Vo", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ix4mEL65on-w" }, "source": [ "The `Model.fit` method adjusts the model parameters to minimize the loss: " ], "id": "ix4mEL65on-w" }, { "cell_type": "code", "metadata": { "id": "y7suUbJXVLqP" }, "source": [ "model.fit(x_train, y_train, epochs=5)" ], "id": "y7suUbJXVLqP", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4mDAAPFqVVgn" }, "source": [ "The `Model.evaluate` method checks the models performance, usually on a \"[Validation-set](https://developers.google.com/machine-learning/glossary#validation-set)\" or \"[Test-set](https://developers.google.com/machine-learning/glossary#test-set)\"." ], "id": "4mDAAPFqVVgn" }, { "cell_type": "code", "metadata": { "id": "F7dTAzgHDUh7" }, "source": [ "model.evaluate(x_test, y_test, verbose=2)" ], "id": "F7dTAzgHDUh7", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "T4JfEh7kvx6m" }, "source": [ "The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the [TensorFlow tutorials](https://www.tensorflow.org/tutorials/)." ], "id": "T4JfEh7kvx6m" }, { "cell_type": "markdown", "metadata": { "id": "V9NZH-k4TwGz" }, "source": [ "### **Pytorch**\n", "\n", "PyTorch is a machine learning framework that is used in both academia and industry for various applications. An alternative to tensorflow" ], "id": "V9NZH-k4TwGz" }, { "cell_type": "markdown", "metadata": { "id": "2KdKKG8qEp4P" }, "source": [ "# **---**" ], "id": "2KdKKG8qEp4P" }, { "cell_type": "markdown", "metadata": { "id": "d94d2871" }, "source": [ "

3. Approaching a Tabular Problem

\n", "

Titanic Survivor Prediction Challenge

\n", "\n", "" ], "id": "d94d2871" }, { "cell_type": "markdown", "metadata": { "id": "1f9dca13" }, "source": [ "

3.1 Understanding the Problem


\n", "\n", "

This challenge is kind of a \"hello world\" program in the whole data Science and Machine Learning community.
The task is simple,use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. In this challenge, we perform a basic EDA and build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

" ], "id": "1f9dca13" }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.098282, "end_time": "2021-05-19T14:07:57.257046", "exception": false, "start_time": "2021-05-19T14:07:57.158764", "status": "completed" }, "tags": [], "id": "exact-peninsula" }, "source": [ "

Contents

\n", "\n", "

1. Data Description

\n", "

2. Exploratory Data Analysis

\n", "

3. Findings From EDA

\n", "

4. Data Preprocessing

\n", "

5. Feature Engineering and Feature Selection

\n", "

6. Data Modeling and Evaluation

\n", "

* Logistic Regression
* Gradient Boosting Classifier
* XgBoost
* SGB Classifier

\n", "

7. Models Comparison

\n" ], "id": "exact-peninsula" }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.09873, "end_time": "2021-05-19T14:07:57.455074", "exception": false, "start_time": "2021-05-19T14:07:57.356344", "status": "completed" }, "tags": [], "id": "legal-network" }, "source": [ "

Data Description


\n", "\n", "
    \n", "
  • Survival : 0 = No, 1 = Yes
  • \n", "
  • pclass(Ticket Class) : 1 = 1st, 2 = 2nd, 3 = 3rd
  • \n", "
  • Sex(Gender) : Male, Female
  • \n", "
  • Age : Age in years
  • \n", "
  • SibSp : Number of siblings/spouses abroad the titanic
  • \n", "
  • Parch : Number of parents/children abrod the titanic
  • \n", "
  • Ticket : Ticket Number
  • \n", "
  • Fare : Passenger fare
  • \n", "
  • Cabin : Cabin Number
  • \n", "
  • Embarked : Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
  • \n", "
" ], "id": "legal-network" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:57.663692Z", "iopub.status.busy": "2021-05-19T14:07:57.663108Z", "iopub.status.idle": "2021-05-19T14:07:58.439813Z", "shell.execute_reply": "2021-05-19T14:07:58.439195Z" }, "papermill": { "duration": 0.884349, "end_time": "2021-05-19T14:07:58.439994", "exception": false, "start_time": "2021-05-19T14:07:57.555645", "status": "completed" }, "tags": [], "id": "cardiac-environment" }, "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "plt.style.use('fivethirtyeight')\n", "%matplotlib inline" ], "id": "cardiac-environment", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:58.654824Z", "iopub.status.busy": "2021-05-19T14:07:58.654308Z", "iopub.status.idle": "2021-05-19T14:07:58.668669Z", "shell.execute_reply": "2021-05-19T14:07:58.668231Z" }, "papermill": { "duration": 0.12592, "end_time": "2021-05-19T14:07:58.668772", "exception": false, "start_time": "2021-05-19T14:07:58.542852", "status": "completed" }, "tags": [], "id": "crude-humanitarian" }, "source": [ "#loading a CSV file using pandas dataframe\n", "train_df = pd.read_csv('./inputs/tab/train.csv')\n", "train_df.head()" ], "id": "crude-humanitarian", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:58.969748Z", "iopub.status.busy": "2021-05-19T14:07:58.969314Z", "iopub.status.idle": "2021-05-19T14:07:59.002179Z", "shell.execute_reply": "2021-05-19T14:07:59.001772Z" }, "papermill": { "duration": 0.102772, "end_time": "2021-05-19T14:07:59.002280", "exception": false, "start_time": "2021-05-19T14:07:58.899508", "status": "completed" }, "tags": [], "id": "growing-feelings" }, "source": [ "#Fetching some statistical information of the data\n", "train_df.describe()" ], "id": "growing-feelings", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:59.283610Z", "iopub.status.busy": "2021-05-19T14:07:59.283183Z", "iopub.status.idle": "2021-05-19T14:07:59.299024Z", "shell.execute_reply": "2021-05-19T14:07:59.299406Z" }, "papermill": { "duration": 0.085522, "end_time": "2021-05-19T14:07:59.299547", "exception": false, "start_time": "2021-05-19T14:07:59.214025", "status": "completed" }, "tags": [], "id": "least-native" }, "source": [ "#Information about a DataFrame including the index dtype and columns, non-null values and memory usage\n", "train_df.info()" ], "id": "least-native", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:59.437540Z", "iopub.status.busy": "2021-05-19T14:07:59.437091Z", "iopub.status.idle": "2021-05-19T14:07:59.445367Z", "shell.execute_reply": "2021-05-19T14:07:59.445002Z" }, "papermill": { "duration": 0.077892, "end_time": "2021-05-19T14:07:59.445467", "exception": false, "start_time": "2021-05-19T14:07:59.367575", "status": "completed" }, "tags": [], "id": "difficult-graduation" }, "source": [ "# Checking for null values\n", "train_df.isna().sum()" ], "id": "difficult-graduation", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.066897, "end_time": "2021-05-19T14:07:59.580266", "exception": false, "start_time": "2021-05-19T14:07:59.513369", "status": "completed" }, "tags": [], "id": "abroad-planner" }, "source": [ "

Exploratory Data Analysis (EDA)


\n", "\n", "

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations

\n" ], "id": "abroad-planner" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:07:59.718588Z", "iopub.status.busy": "2021-05-19T14:07:59.718152Z", "iopub.status.idle": "2021-05-19T14:08:00.528732Z", "shell.execute_reply": "2021-05-19T14:08:00.529096Z" }, "papermill": { "duration": 0.881894, "end_time": "2021-05-19T14:08:00.529225", "exception": false, "start_time": "2021-05-19T14:07:59.647331", "status": "completed" }, "tags": [], "id": "muslim-choice" }, "source": [ "# visualizing null values\n", "import missingno as msno\n", "\n", "msno.bar(train_df)\n", "plt.show()" ], "id": "muslim-choice", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:00.667813Z", "iopub.status.busy": "2021-05-19T14:08:00.667394Z", "iopub.status.idle": "2021-05-19T14:08:00.975656Z", "shell.execute_reply": "2021-05-19T14:08:00.976126Z" }, "papermill": { "duration": 0.378582, "end_time": "2021-05-19T14:08:00.976255", "exception": false, "start_time": "2021-05-19T14:08:00.597673", "status": "completed" }, "tags": [], "id": "contemporary-baker" }, "source": [ "# Correlation and mulitcolinarity\n", "plt.figure(figsize = (18, 8))\n", "corr = train_df.corr()\n", "\n", "sns.heatmap(corr, annot = True, fmt = '.2f', linewidths = 1, annot_kws = {'size' : 15})\n", "plt.show()" ], "id": "contemporary-baker", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.069434, "end_time": "2021-05-19T14:08:01.256086", "exception": false, "start_time": "2021-05-19T14:08:01.186652", "status": "completed" }, "tags": [], "id": "functional-republican" }, "source": [ "

Survived Column - Target


" ], "id": "functional-republican" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:01.398982Z", "iopub.status.busy": "2021-05-19T14:08:01.398534Z", "iopub.status.idle": "2021-05-19T14:08:01.516340Z", "shell.execute_reply": "2021-05-19T14:08:01.516769Z" }, "papermill": { "duration": 0.190554, "end_time": "2021-05-19T14:08:01.516919", "exception": false, "start_time": "2021-05-19T14:08:01.326365", "status": "completed" }, "tags": [], "id": "caring-irish" }, "source": [ "plt.figure(figsize = (12, 7))\n", "\n", "sns.countplot(y = 'Survived', data = train_df)\n", "plt.show()" ], "id": "caring-irish", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:01.660335Z", "iopub.status.busy": "2021-05-19T14:08:01.659911Z", "iopub.status.idle": "2021-05-19T14:08:01.786932Z", "shell.execute_reply": "2021-05-19T14:08:01.786549Z" }, "papermill": { "duration": 0.199188, "end_time": "2021-05-19T14:08:01.787035", "exception": false, "start_time": "2021-05-19T14:08:01.587847", "status": "completed" }, "tags": [], "id": "renewable-apparel" }, "source": [ "#pie chart to show Survived vs Not in percentage\n", "values = train_df['Survived'].value_counts()\n", "labels = ['Not Survived', 'Survived']\n", "\n", "fig, ax = plt.subplots(figsize = (5, 5), dpi = 100)\n", "explode = (0, 0.06)\n", "\n", "patches, texts, autotexts = ax.pie(values, labels = labels, autopct = '%1.2f%%', shadow = True,\n", " startangle = 90, explode = explode)\n", "\n", "plt.setp(texts, color = 'grey')\n", "plt.setp(autotexts, size = 12, color = 'white')\n", "autotexts[1].set_color('black')\n", "plt.show()" ], "id": "renewable-apparel", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.07013, "end_time": "2021-05-19T14:08:01.927411", "exception": false, "start_time": "2021-05-19T14:08:01.857281", "status": "completed" }, "tags": [], "id": "selective-investing" }, "source": [ "

Pclass Column


" ], "id": "selective-investing" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:02.077802Z", "iopub.status.busy": "2021-05-19T14:08:02.077135Z", "iopub.status.idle": "2021-05-19T14:08:02.079866Z", "shell.execute_reply": "2021-05-19T14:08:02.080375Z" }, "papermill": { "duration": 0.081457, "end_time": "2021-05-19T14:08:02.080524", "exception": false, "start_time": "2021-05-19T14:08:01.999067", "status": "completed" }, "tags": [], "id": "narrative-validation" }, "source": [ "#individual count of each class\n", "train_df.Pclass.value_counts()" ], "id": "narrative-validation", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:02.298537Z", "iopub.status.busy": "2021-05-19T14:08:02.297748Z", "iopub.status.idle": "2021-05-19T14:08:02.305355Z", "shell.execute_reply": "2021-05-19T14:08:02.304839Z" }, "papermill": { "duration": 0.117418, "end_time": "2021-05-19T14:08:02.305476", "exception": false, "start_time": "2021-05-19T14:08:02.188058", "status": "completed" }, "tags": [], "id": "accredited-garlic" }, "source": [ "#Individual counts of Pclass based on Target\n", "train_df.groupby(['Pclass', 'Survived'])['Survived'].count()" ], "id": "accredited-garlic", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:02.517775Z", "iopub.status.busy": "2021-05-19T14:08:02.517163Z", "iopub.status.idle": "2021-05-19T14:08:02.749045Z", "shell.execute_reply": "2021-05-19T14:08:02.748420Z" }, "papermill": { "duration": 0.341842, "end_time": "2021-05-19T14:08:02.749169", "exception": false, "start_time": "2021-05-19T14:08:02.407327", "status": "completed" }, "tags": [], "id": "administrative-blogger" }, "source": [ "#bar chart showing Pclass counts based on Targets\n", "plt.figure(figsize = (16, 8))\n", "\n", "sns.countplot('Pclass', hue = 'Survived', data = train_df)\n", "plt.show()" ], "id": "administrative-blogger", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:02.987043Z", "iopub.status.busy": "2021-05-19T14:08:02.985056Z", "iopub.status.idle": "2021-05-19T14:08:03.080904Z", "shell.execute_reply": "2021-05-19T14:08:03.081279Z" }, "papermill": { "duration": 0.223682, "end_time": "2021-05-19T14:08:03.081435", "exception": false, "start_time": "2021-05-19T14:08:02.857753", "status": "completed" }, "tags": [], "id": "associate-colorado" }, "source": [ "#Pie chart showing Pclass counts percentage based on Targets\n", "values = train_df['Pclass'].value_counts()\n", "labels = ['Third Class', 'Second Class', 'First Class']\n", "explode = (0, 0, 0.08)\n", "fig, ax = plt.subplots(figsize = (5, 6), dpi = 100)\n", "patches, texts, autotexts = ax.pie(values, labels = labels, autopct = '%1.2f%%', shadow = True,\n", " startangle = 90, explode = explode)\n", "plt.setp(texts, color = 'grey')\n", "plt.setp(autotexts, size = 13, color = 'white')\n", "autotexts[2].set_color('black')\n", "plt.show()" ], "id": "associate-colorado", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.074364, "end_time": "2021-05-19T14:08:03.704338", "exception": false, "start_time": "2021-05-19T14:08:03.629974", "status": "completed" }, "tags": [], "id": "wired-breach" }, "source": [ "

Name Column


" ], "id": "wired-breach" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:03.859190Z", "iopub.status.busy": "2021-05-19T14:08:03.858702Z", "iopub.status.idle": "2021-05-19T14:08:03.861999Z", "shell.execute_reply": "2021-05-19T14:08:03.862375Z" }, "papermill": { "duration": 0.08402, "end_time": "2021-05-19T14:08:03.862517", "exception": false, "start_time": "2021-05-19T14:08:03.778497", "status": "completed" }, "tags": [], "id": "blank-somewhere" }, "source": [ "#Count of uniqes names\n", "train_df.Name.value_counts()" ], "id": "blank-somewhere", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:04.017576Z", "iopub.status.busy": "2021-05-19T14:08:04.017123Z", "iopub.status.idle": "2021-05-19T14:08:04.020719Z", "shell.execute_reply": "2021-05-19T14:08:04.021142Z" }, "papermill": { "duration": 0.082248, "end_time": "2021-05-19T14:08:04.021269", "exception": false, "start_time": "2021-05-19T14:08:03.939021", "status": "completed" }, "tags": [], "id": "otherwise-decline" }, "source": [ "len(train_df.Name.unique()), train_df.shape" ], "id": "otherwise-decline", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.075265, "end_time": "2021-05-19T14:08:04.173597", "exception": false, "start_time": "2021-05-19T14:08:04.098332", "status": "completed" }, "tags": [], "id": "normal-breast" }, "source": [ "

Gender Column


" ], "id": "normal-breast" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:04.362959Z", "iopub.status.busy": "2021-05-19T14:08:04.362484Z", "iopub.status.idle": "2021-05-19T14:08:04.365310Z", "shell.execute_reply": "2021-05-19T14:08:04.364950Z" }, "papermill": { "duration": 0.084562, "end_time": "2021-05-19T14:08:04.365425", "exception": false, "start_time": "2021-05-19T14:08:04.280863", "status": "completed" }, "tags": [], "id": "naval-helena" }, "source": [ "#Indvidual count of Male vs Female\n", "train_df.Sex.value_counts()" ], "id": "naval-helena", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:04.520314Z", "iopub.status.busy": "2021-05-19T14:08:04.519862Z", "iopub.status.idle": "2021-05-19T14:08:04.525904Z", "shell.execute_reply": "2021-05-19T14:08:04.526259Z" }, "papermill": { "duration": 0.084636, "end_time": "2021-05-19T14:08:04.526378", "exception": false, "start_time": "2021-05-19T14:08:04.441742", "status": "completed" }, "tags": [], "id": "moral-mount" }, "source": [ "#Count of Male vs Female grouped on Target\n", "train_df.groupby(['Sex', 'Survived'])['Survived'].count()" ], "id": "moral-mount", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:04.680592Z", "iopub.status.busy": "2021-05-19T14:08:04.680157Z", "iopub.status.idle": "2021-05-19T14:08:04.801786Z", "shell.execute_reply": "2021-05-19T14:08:04.802135Z" }, "papermill": { "duration": 0.199724, "end_time": "2021-05-19T14:08:04.802266", "exception": false, "start_time": "2021-05-19T14:08:04.602542", "status": "completed" }, "tags": [], "id": "third-hawaiian" }, "source": [ "#Individual counts of Gender based on Target\n", "plt.figure(figsize = (16, 7))\n", "\n", "sns.countplot('Sex', hue = 'Survived', data = train_df)\n", "plt.show()" ], "id": "third-hawaiian", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:04.957938Z", "iopub.status.busy": "2021-05-19T14:08:04.957474Z", "iopub.status.idle": "2021-05-19T14:08:05.566849Z", "shell.execute_reply": "2021-05-19T14:08:05.567369Z" }, "papermill": { "duration": 0.688623, "end_time": "2021-05-19T14:08:05.567525", "exception": false, "start_time": "2021-05-19T14:08:04.878902", "status": "completed" }, "tags": [], "id": "victorian-sunday" }, "source": [ "#Individual counts of Gender based on Target for each Pclass Category\n", "sns.catplot(x = 'Sex', y = 'Survived', data = train_df, kind = 'bar', col = 'Pclass')\n", "plt.show()" ], "id": "victorian-sunday", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.079778, "end_time": "2021-05-19T14:08:06.827068", "exception": false, "start_time": "2021-05-19T14:08:06.747290", "status": "completed" }, "tags": [], "id": "extended-summary" }, "source": [ "

Age Column


" ], "id": "extended-summary" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:07.001326Z", "iopub.status.busy": "2021-05-19T14:08:07.000866Z", "iopub.status.idle": "2021-05-19T14:08:07.201761Z", "shell.execute_reply": "2021-05-19T14:08:07.201309Z" }, "papermill": { "duration": 0.295487, "end_time": "2021-05-19T14:08:07.201868", "exception": false, "start_time": "2021-05-19T14:08:06.906381", "status": "completed" }, "tags": [], "id": "charitable-investigation" }, "source": [ "#Age distribution\n", "plt.figure(figsize = (15, 6))\n", "plt.style.use('ggplot')\n", "\n", "sns.distplot(train_df['Age'])\n", "plt.show()" ], "id": "charitable-investigation", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:07.371299Z", "iopub.status.busy": "2021-05-19T14:08:07.370810Z", "iopub.status.idle": "2021-05-19T14:08:07.592930Z", "shell.execute_reply": "2021-05-19T14:08:07.592448Z" }, "papermill": { "duration": 0.3099, "end_time": "2021-05-19T14:08:07.593034", "exception": false, "start_time": "2021-05-19T14:08:07.283134", "status": "completed" }, "tags": [], "id": "northern-offset" }, "source": [ "#Checking outliers in Age grouped by Gender\n", "sns.catplot(x = 'Sex', y = 'Age', kind = 'box', data = train_df, height = 5, aspect = 2)\n", "plt.show()" ], "id": "northern-offset", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:07.762989Z", "iopub.status.busy": "2021-05-19T14:08:07.762510Z", "iopub.status.idle": "2021-05-19T14:08:08.271980Z", "shell.execute_reply": "2021-05-19T14:08:08.271484Z" }, "papermill": { "duration": 0.598278, "end_time": "2021-05-19T14:08:08.272090", "exception": false, "start_time": "2021-05-19T14:08:07.673812", "status": "completed" }, "tags": [], "id": "silver-commitment" }, "source": [ "#Checking outliers in Age grouped by Gender and Pclass\n", "sns.catplot(x = 'Sex', y = 'Age', kind = 'box', data = train_df, col = 'Pclass')\n", "plt.show()" ], "id": "silver-commitment", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "8dcfff41" }, "source": [ "

Cabin Column


\n", "\n", "\n", "\n" ], "id": "8dcfff41" }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.080796, "end_time": "2021-05-19T14:08:08.435523", "exception": false, "start_time": "2021-05-19T14:08:08.354727", "status": "completed" }, "tags": [], "id": "confirmed-italy" }, "source": [ "

Fare Column


" ], "id": "confirmed-italy" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:08.618643Z", "iopub.status.busy": "2021-05-19T14:08:08.607416Z", "iopub.status.idle": "2021-05-19T14:08:08.854098Z", "shell.execute_reply": "2021-05-19T14:08:08.853658Z" }, "papermill": { "duration": 0.33511, "end_time": "2021-05-19T14:08:08.854203", "exception": false, "start_time": "2021-05-19T14:08:08.519093", "status": "completed" }, "tags": [], "id": "rolled-coverage" }, "source": [ "#Histogram showing Fare counts\n", "plt.figure(figsize = (14, 6))\n", "\n", "plt.hist(train_df.Fare, bins = 60, color = 'orange')\n", "plt.xlabel('Fare')\n", "plt.show()" ], "id": "rolled-coverage", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:09.192273Z", "iopub.status.busy": "2021-05-19T14:08:09.190393Z", "iopub.status.idle": "2021-05-19T14:08:09.859094Z", "shell.execute_reply": "2021-05-19T14:08:09.858623Z" }, "papermill": { "duration": 0.756902, "end_time": "2021-05-19T14:08:09.859198", "exception": false, "start_time": "2021-05-19T14:08:09.102296", "status": "completed" }, "tags": [], "id": "powerful-diesel" }, "source": [ "#outliers in Fare based on Gender and Pclass\n", "sns.catplot(x = 'Sex', y = 'Fare', data = train_df, kind = 'box', col = 'Pclass')\n", "plt.show()" ], "id": "powerful-diesel", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.076815, "end_time": "2021-05-19T14:08:10.013683", "exception": false, "start_time": "2021-05-19T14:08:09.936868", "status": "completed" }, "tags": [], "id": "based-specific" }, "source": [ "

SibSp Column


" ], "id": "based-specific" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:10.174969Z", "iopub.status.busy": "2021-05-19T14:08:10.174369Z", "iopub.status.idle": "2021-05-19T14:08:10.176899Z", "shell.execute_reply": "2021-05-19T14:08:10.177365Z" }, "papermill": { "duration": 0.086483, "end_time": "2021-05-19T14:08:10.177502", "exception": false, "start_time": "2021-05-19T14:08:10.091019", "status": "completed" }, "tags": [], "id": "brazilian-proof" }, "source": [ "#Count of people with Siblings or spouse \n", "train_df['SibSp'].value_counts()" ], "id": "brazilian-proof", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:10.415478Z", "iopub.status.busy": "2021-05-19T14:08:10.414985Z", "iopub.status.idle": "2021-05-19T14:08:10.589368Z", "shell.execute_reply": "2021-05-19T14:08:10.588937Z" }, "papermill": { "duration": 0.295952, "end_time": "2021-05-19T14:08:10.589471", "exception": false, "start_time": "2021-05-19T14:08:10.293519", "status": "completed" }, "tags": [], "id": "stuck-contributor" }, "source": [ "#count of people with Sibilings or Spouse based on Target\n", "plt.figure(figsize = (16, 5))\n", "\n", "sns.countplot(x = 'SibSp', data = train_df, hue = 'Survived')\n", "plt.show()" ], "id": "stuck-contributor", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:10.751377Z", "iopub.status.busy": "2021-05-19T14:08:10.750773Z", "iopub.status.idle": "2021-05-19T14:08:11.087843Z", "shell.execute_reply": "2021-05-19T14:08:11.088226Z" }, "papermill": { "duration": 0.421161, "end_time": "2021-05-19T14:08:11.088352", "exception": false, "start_time": "2021-05-19T14:08:10.667191", "status": "completed" }, "tags": [], "id": "weird-italy" }, "source": [ "#Count of people who survived based on SibSp counts\n", "sns.catplot(x = 'SibSp', y = 'Survived', kind = 'bar', data = train_df, height = 5, aspect =2)\n", "plt.show()" ], "id": "weird-italy", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:11.263589Z", "iopub.status.busy": "2021-05-19T14:08:11.262899Z", "iopub.status.idle": "2021-05-19T14:08:11.812454Z", "shell.execute_reply": "2021-05-19T14:08:11.812803Z" }, "papermill": { "duration": 0.639754, "end_time": "2021-05-19T14:08:11.812958", "exception": false, "start_time": "2021-05-19T14:08:11.173204", "status": "completed" }, "tags": [], "id": "magnetic-laundry" }, "source": [ "#Count of people who survived based on SibSp counts Gender wise\n", "sns.catplot(x = 'SibSp', y = 'Survived', kind = 'bar', hue = 'Sex', data = train_df, height = 6, aspect = 2)\n", "plt.show()" ], "id": "magnetic-laundry", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:11.991611Z", "iopub.status.busy": "2021-05-19T14:08:11.991129Z", "iopub.status.idle": "2021-05-19T14:08:12.665522Z", "shell.execute_reply": "2021-05-19T14:08:12.665114Z" }, "papermill": { "duration": 0.767073, "end_time": "2021-05-19T14:08:12.665626", "exception": false, "start_time": "2021-05-19T14:08:11.898553", "status": "completed" }, "tags": [], "id": "appointed-housing" }, "source": [ "sns.catplot(x = 'SibSp', y = 'Survived', kind = 'bar', col = 'Sex', data = train_df)\n", "plt.show()" ], "id": "appointed-housing", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:12.843748Z", "iopub.status.busy": "2021-05-19T14:08:12.842459Z", "iopub.status.idle": "2021-05-19T14:08:13.546099Z", "shell.execute_reply": "2021-05-19T14:08:13.545548Z" }, "papermill": { "duration": 0.793747, "end_time": "2021-05-19T14:08:13.546200", "exception": false, "start_time": "2021-05-19T14:08:12.752453", "status": "completed" }, "tags": [], "id": "independent-ranch" }, "source": [ "#Count of people who survived based on SibSp counts Gender and Pclass wise\n", "sns.catplot(x = 'SibSp', y = 'Survived', col = 'Pclass', kind = 'bar', data = train_df)\n", "plt.show()" ], "id": "independent-ranch", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.086248, "end_time": "2021-05-19T14:08:14.483835", "exception": false, "start_time": "2021-05-19T14:08:14.397587", "status": "completed" }, "tags": [], "id": "superior-corner" }, "source": [ "

Parch Column


" ], "id": "superior-corner" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:14.664702Z", "iopub.status.busy": "2021-05-19T14:08:14.664121Z", "iopub.status.idle": "2021-05-19T14:08:14.666585Z", "shell.execute_reply": "2021-05-19T14:08:14.666975Z" }, "papermill": { "duration": 0.094913, "end_time": "2021-05-19T14:08:14.667096", "exception": false, "start_time": "2021-05-19T14:08:14.572183", "status": "completed" }, "tags": [], "id": "beneficial-distance" }, "source": [ "#Count of people with Parents/Childrens\n", "train_df.Parch.value_counts()" ], "id": "beneficial-distance", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:14.847921Z", "iopub.status.busy": "2021-05-19T14:08:14.847267Z", "iopub.status.idle": "2021-05-19T14:08:15.322517Z", "shell.execute_reply": "2021-05-19T14:08:15.322924Z" }, "papermill": { "duration": 0.567021, "end_time": "2021-05-19T14:08:15.323053", "exception": false, "start_time": "2021-05-19T14:08:14.756032", "status": "completed" }, "tags": [], "id": "strange-blake" }, "source": [ "sns.catplot(x = 'Parch', y = 'Survived', data = train_df, hue = 'Sex', kind = 'bar', height = 6, aspect = 2)\n", "plt.show()" ], "id": "strange-blake", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.087479, "end_time": "2021-05-19T14:08:15.501074", "exception": false, "start_time": "2021-05-19T14:08:15.413595", "status": "completed" }, "tags": [], "id": "familiar-advocacy" }, "source": [ "

Ticket Column


" ], "id": "familiar-advocacy" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:15.685683Z", "iopub.status.busy": "2021-05-19T14:08:15.682848Z", "iopub.status.idle": "2021-05-19T14:08:15.688739Z", "shell.execute_reply": "2021-05-19T14:08:15.689241Z" }, "papermill": { "duration": 0.098836, "end_time": "2021-05-19T14:08:15.689404", "exception": false, "start_time": "2021-05-19T14:08:15.590568", "status": "completed" }, "tags": [], "id": "acting-venture" }, "source": [ "#count based on ticket types\n", "train_df.Ticket.value_counts()" ], "id": "acting-venture", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:15.960680Z", "iopub.status.busy": "2021-05-19T14:08:15.959860Z", "iopub.status.idle": "2021-05-19T14:08:15.963096Z", "shell.execute_reply": "2021-05-19T14:08:15.963551Z" }, "papermill": { "duration": 0.139887, "end_time": "2021-05-19T14:08:15.963709", "exception": false, "start_time": "2021-05-19T14:08:15.823822", "status": "completed" }, "tags": [], "id": "unauthorized-southwest" }, "source": [ "len(train_df.Ticket.unique())" ], "id": "unauthorized-southwest", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.132593, "end_time": "2021-05-19T14:08:16.228860", "exception": false, "start_time": "2021-05-19T14:08:16.096267", "status": "completed" }, "tags": [], "id": "exclusive-netscape" }, "source": [ "

Embarked Column


" ], "id": "exclusive-netscape" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:16.502417Z", "iopub.status.busy": "2021-05-19T14:08:16.501652Z", "iopub.status.idle": "2021-05-19T14:08:16.506262Z", "shell.execute_reply": "2021-05-19T14:08:16.505647Z" }, "papermill": { "duration": 0.143835, "end_time": "2021-05-19T14:08:16.506389", "exception": false, "start_time": "2021-05-19T14:08:16.362554", "status": "completed" }, "tags": [], "id": "listed-pocket" }, "source": [ "train_df['Embarked'].value_counts()" ], "id": "listed-pocket", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:16.699174Z", "iopub.status.busy": "2021-05-19T14:08:16.698675Z", "iopub.status.idle": "2021-05-19T14:08:16.832975Z", "shell.execute_reply": "2021-05-19T14:08:16.832317Z" }, "papermill": { "duration": 0.231563, "end_time": "2021-05-19T14:08:16.833160", "exception": false, "start_time": "2021-05-19T14:08:16.601597", "status": "completed" }, "tags": [], "id": "given-parameter" }, "source": [ "#Count of people based on Port Embarked Gender wise\n", "plt.figure(figsize = (14, 6))\n", "\n", "sns.countplot('Embarked', hue = 'Survived', data = train_df)\n", "plt.show()" ], "id": "given-parameter", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:17.109835Z", "iopub.status.busy": "2021-05-19T14:08:17.109022Z", "iopub.status.idle": "2021-05-19T14:08:17.491860Z", "shell.execute_reply": "2021-05-19T14:08:17.491166Z" }, "papermill": { "duration": 0.523038, "end_time": "2021-05-19T14:08:17.491999", "exception": false, "start_time": "2021-05-19T14:08:16.968961", "status": "completed" }, "tags": [], "id": "welsh-lunch" }, "source": [ "sns.catplot(x = 'Embarked', y = 'Survived', kind = 'bar', data = train_df, col = 'Sex')\n", "plt.show()" ], "id": "welsh-lunch", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.091561, "end_time": "2021-05-19T14:08:17.676978", "exception": false, "start_time": "2021-05-19T14:08:17.585417", "status": "completed" }, "tags": [], "id": "ahead-teacher" }, "source": [ "\n", "

Findings From EDA :-

\n", "\n", "
    \n", "
  • Females Survived more than Males.
  • \n", "
  • Passengers Travelling in Higher Class Survived More than Passengers travelling in Lower Class.
  • \n", "
  • Name column is having all unique values so this column is not suitable for prediction, we have to drop it.
  • \n", "
  • In First Class Females were more than Males, that's why Fare of Females Passengers were high.
  • \n", "
  • Survival Rate is higher for those who were travelling with siblings or spouses.
  • \n", "
  • Passengers travelling with parents or children have higher survival rate.
  • \n", "
  • Ticket column is not useful and does not have an impact on survival.
  • \n", "
  • Cabin column have a lot of null values , it will be better to drop this column.
  • \n", "
  • Passengers travelling from Cherbourg port survived more than passengers travelling from other two ports.
  • \n", "
" ], "id": "ahead-teacher" }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.136998, "end_time": "2021-05-19T14:08:17.906660", "exception": false, "start_time": "2021-05-19T14:08:17.769662", "status": "completed" }, "tags": [], "id": "polished-spoke" }, "source": [ "

3.2 Data Preprocessing


" ], "id": "polished-spoke" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:18.180601Z", "iopub.status.busy": "2021-05-19T14:08:18.180027Z", "iopub.status.idle": "2021-05-19T14:08:18.181997Z", "shell.execute_reply": "2021-05-19T14:08:18.181508Z" }, "papermill": { "duration": 0.141156, "end_time": "2021-05-19T14:08:18.182109", "exception": false, "start_time": "2021-05-19T14:08:18.040953", "status": "completed" }, "tags": [], "id": "incredible-orientation" }, "source": [ "# dropping useless columns with noise, missing data ..etc\n", "train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)" ], "id": "incredible-orientation", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:18.464810Z", "iopub.status.busy": "2021-05-19T14:08:18.461226Z", "iopub.status.idle": "2021-05-19T14:08:18.468929Z", "shell.execute_reply": "2021-05-19T14:08:18.468533Z" }, "papermill": { "duration": 0.152173, "end_time": "2021-05-19T14:08:18.469039", "exception": false, "start_time": "2021-05-19T14:08:18.316866", "status": "completed" }, "tags": [], "id": "classical-julian" }, "source": [ "train_df.head()" ], "id": "classical-julian", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:18.663050Z", "iopub.status.busy": "2021-05-19T14:08:18.662614Z", "iopub.status.idle": "2021-05-19T14:08:18.665092Z", "shell.execute_reply": "2021-05-19T14:08:18.665490Z" }, "papermill": { "duration": 0.101591, "end_time": "2021-05-19T14:08:18.665593", "exception": false, "start_time": "2021-05-19T14:08:18.564002", "status": "completed" }, "tags": [], "id": "killing-hands" }, "source": [ "train_df.isna().sum()" ], "id": "killing-hands", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:18.870155Z", "iopub.status.busy": "2021-05-19T14:08:18.869679Z", "iopub.status.idle": "2021-05-19T14:08:18.876665Z", "shell.execute_reply": "2021-05-19T14:08:18.876021Z" }, "papermill": { "duration": 0.108189, "end_time": "2021-05-19T14:08:18.876837", "exception": false, "start_time": "2021-05-19T14:08:18.768648", "status": "completed" }, "tags": [], "id": "french-resistance" }, "source": [ "# replacing Zero values of \"Fare\" column with mean of column\n", "train_df['Fare'] = train_df['Fare'].replace(0, train_df['Fare'].mean())" ], "id": "french-resistance", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:19.154160Z", "iopub.status.busy": "2021-05-19T14:08:19.153539Z", "iopub.status.idle": "2021-05-19T14:08:19.158789Z", "shell.execute_reply": "2021-05-19T14:08:19.158271Z" }, "papermill": { "duration": 0.139395, "end_time": "2021-05-19T14:08:19.158933", "exception": false, "start_time": "2021-05-19T14:08:19.019538", "status": "completed" }, "tags": [], "id": "included-dayton" }, "source": [ "# filling null values of \"Age\" column with mean value of the column\n", "\n", "train_df['Age'].fillna(train_df['Age'].mean(), inplace = True)" ], "id": "included-dayton", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:19.440186Z", "iopub.status.busy": "2021-05-19T14:08:19.439725Z", "iopub.status.idle": "2021-05-19T14:08:19.441366Z", "shell.execute_reply": "2021-05-19T14:08:19.441730Z" }, "papermill": { "duration": 0.143508, "end_time": "2021-05-19T14:08:19.441853", "exception": false, "start_time": "2021-05-19T14:08:19.298345", "status": "completed" }, "tags": [], "id": "social-least" }, "source": [ "# filling null values of \"Embarked\" column with mode value of the column\n", "\n", "train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace = True)" ], "id": "social-least", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:19.631739Z", "iopub.status.busy": "2021-05-19T14:08:19.631315Z", "iopub.status.idle": "2021-05-19T14:08:19.639022Z", "shell.execute_reply": "2021-05-19T14:08:19.638621Z" }, "papermill": { "duration": 0.103346, "end_time": "2021-05-19T14:08:19.639127", "exception": false, "start_time": "2021-05-19T14:08:19.535781", "status": "completed" }, "tags": [], "id": "modified-implement" }, "source": [ "# checking for null values after filling null values\n", "\n", "train_df.isna().sum()" ], "id": "modified-implement", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:19.838717Z", "iopub.status.busy": "2021-05-19T14:08:19.838251Z", "iopub.status.idle": "2021-05-19T14:08:19.842184Z", "shell.execute_reply": "2021-05-19T14:08:19.841777Z" }, "papermill": { "duration": 0.109337, "end_time": "2021-05-19T14:08:19.842281", "exception": false, "start_time": "2021-05-19T14:08:19.732944", "status": "completed" }, "tags": [], "id": "opposed-bolivia" }, "source": [ "train_df.head()" ], "id": "opposed-bolivia", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "8af949f1" }, "source": [ "

3.4 Feature Engineering and Feature Selection


" ], "id": "8af949f1" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:20.035372Z", "iopub.status.busy": "2021-05-19T14:08:20.034831Z", "iopub.status.idle": "2021-05-19T14:08:20.037832Z", "shell.execute_reply": "2021-05-19T14:08:20.037398Z" }, "papermill": { "duration": 0.102114, "end_time": "2021-05-19T14:08:20.037956", "exception": false, "start_time": "2021-05-19T14:08:19.935842", "status": "completed" }, "tags": [], "id": "coordinate-stretch" }, "source": [ "'''\n", "Not all Machine Leanrning Algorithms like categorical fearures as strings, \n", "so they need to me encoded to numeric formats so they these alorithms can ingest this data without throwing error \n", "labelEncoding : Converting String to numeric format'''\n", "train_df['Sex'] = train_df['Sex'].apply(lambda val: 1 if val == 'male' else 0)\n", "train_df.head()" ], "id": "coordinate-stretch", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "2d85e3c1" }, "source": [ "

OneHot Encoding : It is the representation of categorical variables as binary vectors


\n", "\n", "\n", "\n", "\n" ], "id": "2d85e3c1" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:20.230040Z", "iopub.status.busy": "2021-05-19T14:08:20.229556Z", "iopub.status.idle": "2021-05-19T14:08:20.231842Z", "shell.execute_reply": "2021-05-19T14:08:20.231489Z" }, "papermill": { "duration": 0.100606, "end_time": "2021-05-19T14:08:20.231980", "exception": false, "start_time": "2021-05-19T14:08:20.131374", "status": "completed" }, "tags": [], "id": "forced-thong" }, "source": [ "# Get one hot encoding of columns 'Embarked'\n", "one_hot = pd.get_dummies(train_df['Embarked'])\n", "# Drop column 'Embarked' as it is now encoded\n", "train_df = train_df.drop('Embarked',axis = 1)\n", "# Join the encoded train_df\n", "train_df = train_df.join(one_hot)" ], "id": "forced-thong", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:20.428374Z", "iopub.status.busy": "2021-05-19T14:08:20.427750Z", "iopub.status.idle": "2021-05-19T14:08:20.431280Z", "shell.execute_reply": "2021-05-19T14:08:20.430917Z" }, "papermill": { "duration": 0.105792, "end_time": "2021-05-19T14:08:20.431378", "exception": false, "start_time": "2021-05-19T14:08:20.325586", "status": "completed" }, "tags": [], "id": "several-dragon" }, "source": [ "train_df.head()" ], "id": "several-dragon", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:20.640766Z", "iopub.status.busy": "2021-05-19T14:08:20.639039Z", "iopub.status.idle": "2021-05-19T14:08:20.664309Z", "shell.execute_reply": "2021-05-19T14:08:20.663919Z" }, "papermill": { "duration": 0.135244, "end_time": "2021-05-19T14:08:20.664415", "exception": false, "start_time": "2021-05-19T14:08:20.529171", "status": "completed" }, "tags": [], "id": "chronic-senator" }, "source": [ "train_df.describe()" ], "id": "chronic-senator", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:20.859452Z", "iopub.status.busy": "2021-05-19T14:08:20.858851Z", "iopub.status.idle": "2021-05-19T14:08:20.862394Z", "shell.execute_reply": "2021-05-19T14:08:20.861945Z" }, "papermill": { "duration": 0.103584, "end_time": "2021-05-19T14:08:20.862506", "exception": false, "start_time": "2021-05-19T14:08:20.758922", "status": "completed" }, "tags": [], "id": "considerable-wrapping" }, "source": [ "#checking for feature variance is very important\n", "#High variance can affect the model performance, espcially for Linear models\n", "#so we need to normalize high variance features\n", "train_df.var()" ], "id": "considerable-wrapping", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:21.248518Z", "iopub.status.busy": "2021-05-19T14:08:21.247996Z", "iopub.status.idle": "2021-05-19T14:08:21.250726Z", "shell.execute_reply": "2021-05-19T14:08:21.249803Z" }, "papermill": { "duration": 0.102114, "end_time": "2021-05-19T14:08:21.250864", "exception": false, "start_time": "2021-05-19T14:08:21.148750", "status": "completed" }, "tags": [], "id": "minor-audio" }, "source": [ "#log normalization\n", "train_df['Age'] = np.log(train_df['Age'])\n", "train_df['Fare'] = np.log(train_df['Fare'])\n" ], "id": "minor-audio", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:21.453576Z", "iopub.status.busy": "2021-05-19T14:08:21.453089Z", "iopub.status.idle": "2021-05-19T14:08:21.456130Z", "shell.execute_reply": "2021-05-19T14:08:21.455646Z" }, "papermill": { "duration": 0.109325, "end_time": "2021-05-19T14:08:21.456237", "exception": false, "start_time": "2021-05-19T14:08:21.346912", "status": "completed" }, "tags": [], "id": "naval-milwaukee" }, "source": [ "train_df.var()" ], "id": "naval-milwaukee", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:21.846619Z", "iopub.status.busy": "2021-05-19T14:08:21.846167Z", "iopub.status.idle": "2021-05-19T14:08:21.858729Z", "shell.execute_reply": "2021-05-19T14:08:21.859119Z" }, "papermill": { "duration": 0.110671, "end_time": "2021-05-19T14:08:21.859254", "exception": false, "start_time": "2021-05-19T14:08:21.748583", "status": "completed" }, "tags": [], "id": "inside-fight" }, "source": [ "test_df = pd.read_csv('./inputs/tab/test.csv')\n", "test_df.head()" ], "id": "inside-fight", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:22.452860Z", "iopub.status.busy": "2021-05-19T14:08:22.452437Z", "iopub.status.idle": "2021-05-19T14:08:22.458183Z", "shell.execute_reply": "2021-05-19T14:08:22.457771Z" }, "papermill": { "duration": 0.104201, "end_time": "2021-05-19T14:08:22.458288", "exception": false, "start_time": "2021-05-19T14:08:22.354087", "status": "completed" }, "tags": [], "id": "micro-velvet" }, "source": [ "# dropping useless columns\n", "test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)\n", "# replacing Zero values of \"Fare\" column with mean of column\n", "test_df['Fare'] = test_df['Fare'].replace(0, test_df['Fare'].mean())\n", "# filling null values of \"Age\" column with mean value of the column\n", "test_df['Age'].fillna(test_df['Age'].mean(), inplace = True)\n", "# filling null values of \"Embarked\" column with mode value of the column\n", "test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace = True)" ], "id": "micro-velvet", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:22.655972Z", "iopub.status.busy": "2021-05-19T14:08:22.655464Z", "iopub.status.idle": "2021-05-19T14:08:22.658269Z", "shell.execute_reply": "2021-05-19T14:08:22.657866Z" }, "papermill": { "duration": 0.104291, "end_time": "2021-05-19T14:08:22.658370", "exception": false, "start_time": "2021-05-19T14:08:22.554079", "status": "completed" }, "tags": [], "id": "fitted-corporation" }, "source": [ "test_df.isna().sum()" ], "id": "fitted-corporation", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:23.460928Z", "iopub.status.busy": "2021-05-19T14:08:23.460290Z", "iopub.status.idle": "2021-05-19T14:08:23.464053Z", "shell.execute_reply": "2021-05-19T14:08:23.464413Z" }, "papermill": { "duration": 0.102972, "end_time": "2021-05-19T14:08:23.464538", "exception": false, "start_time": "2021-05-19T14:08:23.361566", "status": "completed" }, "tags": [], "id": "nervous-nelson" }, "source": [ "# filling null values of \"Fare\" column with mean value of the column\n", "test_df['Fare'].fillna(test_df['Fare'].mean(), inplace = True)\n", "test_df['Sex'] = test_df['Sex'].apply(lambda val: 1 if val == 'male' else 0)\n", "\n", "# Get one hot encoding of columns 'Embarked'\n", "one_hot_test = pd.get_dummies(test_df['Embarked'])\n", "# Drop column 'Embarked' as it is now encoded\n", "test_df = test_df.drop('Embarked',axis = 1)\n", "# Join the encoded train_df\n", "test_df = test_df.join(one_hot_test)\n", "# Log Normalization\n", "test_df['Age'] = np.log(test_df['Age'])\n", "test_df['Fare'] = np.log(test_df['Fare'])\n", "test_df.head()" ], "id": "nervous-nelson", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:24.487582Z", "iopub.status.busy": "2021-05-19T14:08:24.485670Z", "iopub.status.idle": "2021-05-19T14:08:24.490531Z", "shell.execute_reply": "2021-05-19T14:08:24.490868Z" }, "papermill": { "duration": 0.10742, "end_time": "2021-05-19T14:08:24.491029", "exception": false, "start_time": "2021-05-19T14:08:24.383609", "status": "completed" }, "tags": [], "id": "exempt-toyota" }, "source": [ "test_df.var()" ], "id": "exempt-toyota", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:24.696985Z", "iopub.status.busy": "2021-05-19T14:08:24.696440Z", "iopub.status.idle": "2021-05-19T14:08:24.700799Z", "shell.execute_reply": "2021-05-19T14:08:24.700398Z" }, "papermill": { "duration": 0.111132, "end_time": "2021-05-19T14:08:24.700934", "exception": false, "start_time": "2021-05-19T14:08:24.589802", "status": "completed" }, "tags": [], "id": "mysterious-convert" }, "source": [ "test_df.isna().any()" ], "id": "mysterious-convert", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:24.901650Z", "iopub.status.busy": "2021-05-19T14:08:24.901020Z", "iopub.status.idle": "2021-05-19T14:08:24.910365Z", "shell.execute_reply": "2021-05-19T14:08:24.910812Z" }, "papermill": { "duration": 0.109853, "end_time": "2021-05-19T14:08:24.910949", "exception": false, "start_time": "2021-05-19T14:08:24.801096", "status": "completed" }, "tags": [], "id": "circular-venue" }, "source": [ "test_df.head()" ], "id": "circular-venue", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:25.310340Z", "iopub.status.busy": "2021-05-19T14:08:25.309841Z", "iopub.status.idle": "2021-05-19T14:08:25.312407Z", "shell.execute_reply": "2021-05-19T14:08:25.312021Z" }, "papermill": { "duration": 0.106042, "end_time": "2021-05-19T14:08:25.312510", "exception": false, "start_time": "2021-05-19T14:08:25.206468", "status": "completed" }, "tags": [], "id": "arbitrary-external" }, "source": [ "# Dividing data to features and targets\n", "X = train_df.drop('Survived', axis = 1)\n", "y = train_df['Survived']" ], "id": "arbitrary-external", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:25.514022Z", "iopub.status.busy": "2021-05-19T14:08:25.513575Z", "iopub.status.idle": "2021-05-19T14:08:25.883200Z", "shell.execute_reply": "2021-05-19T14:08:25.883570Z" }, "papermill": { "duration": 0.471693, "end_time": "2021-05-19T14:08:25.883709", "exception": false, "start_time": "2021-05-19T14:08:25.412016", "status": "completed" }, "tags": [], "id": "ecological-honolulu" }, "source": [ "# splitting data intp training and test set\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)" ], "id": "ecological-honolulu", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.099773, "end_time": "2021-05-19T14:08:26.084073", "exception": false, "start_time": "2021-05-19T14:08:25.984300", "status": "completed" }, "tags": [], "id": "adolescent-township" }, "source": [ "

Data Modeling


\n", "\n", "

Training Machine Learning models to predict survivors


" ], "id": "adolescent-township" }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.099202, "end_time": "2021-05-19T14:08:26.283145", "exception": false, "start_time": "2021-05-19T14:08:26.183943", "status": "completed" }, "tags": [], "id": "mounted-windows" }, "source": [ "

1. Logistic Regression


" ], "id": "mounted-windows" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:26.486843Z", "iopub.status.busy": "2021-05-19T14:08:26.486406Z", "iopub.status.idle": "2021-05-19T14:08:26.616797Z", "shell.execute_reply": "2021-05-19T14:08:26.616355Z" }, "papermill": { "duration": 0.2332, "end_time": "2021-05-19T14:08:26.616918", "exception": false, "start_time": "2021-05-19T14:08:26.383718", "status": "completed" }, "tags": [], "id": "gross-scotland" }, "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression()\n", "lr.fit(X_train, y_train)\n", "\n", "# accuracy score, confusion matrix and classification report of logistic regression\n", "\n", "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report\n", "\n", "lr_acc = accuracy_score(y_test, lr.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Logistic Regression is {accuracy_score(y_train, lr.predict(X_train))}\")\n", "print(f\"Test Accuracy of Logistic Regression is {lr_acc}\")\n", "\n", "print(f\"Confusion Matrix :- \\n {confusion_matrix(y_test, lr.predict(X_test))}\")\n", "print(f\"Classofocation Report : -\\n {classification_report(y_test, lr.predict(X_test))}\")" ], "id": "gross-scotland", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:26.822702Z", "iopub.status.busy": "2021-05-19T14:08:26.822115Z", "iopub.status.idle": "2021-05-19T14:08:28.570110Z", "shell.execute_reply": "2021-05-19T14:08:28.569687Z" }, "papermill": { "duration": 1.853953, "end_time": "2021-05-19T14:08:28.570217", "exception": false, "start_time": "2021-05-19T14:08:26.716264", "status": "completed" }, "tags": [], "id": "casual-orientation" }, "source": [ "# hyper parameter tuning of logistic regression\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "grid_param = {\n", " 'penalty': ['l1', 'l2'],\n", " 'C' : [0.001, 0.01, 0.1, 0.005, 0.5, 1, 10]\n", "}\n", "\n", "grid_search_lr = GridSearchCV(lr, grid_param, cv = 5, n_jobs = -1, verbose = 1)\n", "grid_search_lr.fit(X_train, y_train)" ], "id": "casual-orientation", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:28.777354Z", "iopub.status.busy": "2021-05-19T14:08:28.776665Z", "iopub.status.idle": "2021-05-19T14:08:28.780007Z", "shell.execute_reply": "2021-05-19T14:08:28.779604Z" }, "papermill": { "duration": 0.109751, "end_time": "2021-05-19T14:08:28.780116", "exception": false, "start_time": "2021-05-19T14:08:28.670365", "status": "completed" }, "tags": [], "id": "motivated-control" }, "source": [ "# best parameters and best score\n", "\n", "print(grid_search_lr.best_params_)\n", "print(grid_search_lr.best_score_)" ], "id": "motivated-control", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:28.990744Z", "iopub.status.busy": "2021-05-19T14:08:28.988940Z", "iopub.status.idle": "2021-05-19T14:08:29.002490Z", "shell.execute_reply": "2021-05-19T14:08:29.002118Z" }, "papermill": { "duration": 0.12053, "end_time": "2021-05-19T14:08:29.002595", "exception": false, "start_time": "2021-05-19T14:08:28.882065", "status": "completed" }, "tags": [], "id": "spiritual-puppy" }, "source": [ "# best estimator\n", "\n", "lr = grid_search_lr.best_estimator_\n", "\n", "# accuracy score, confusion matrix and classification report of logistic regression\n", "\n", "lr_acc = accuracy_score(y_test, lr.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Logistic Regression is {accuracy_score(y_train, lr.predict(X_train))}\")\n", "print(f\"Test Accuracy of Logistic Regression is {lr_acc}\")\n", "\n", "print(f\"Confusion Matrix :- \\n {confusion_matrix(y_test, lr.predict(X_test))}\")\n", "print(f\"Classofocation Report : -\\n {classification_report(y_test, lr.predict(X_test))}\")" ], "id": "spiritual-puppy", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.153113, "end_time": "2021-05-19T14:08:41.327013", "exception": false, "start_time": "2021-05-19T14:08:41.173900", "status": "completed" }, "tags": [], "id": "parliamentary-constitution" }, "source": [ "

2. Random Forest


" ], "id": "parliamentary-constitution" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:41.642248Z", "iopub.status.busy": "2021-05-19T14:08:41.641568Z", "iopub.status.idle": "2021-05-19T14:08:41.971499Z", "shell.execute_reply": "2021-05-19T14:08:41.970991Z" }, "papermill": { "duration": 0.489435, "end_time": "2021-05-19T14:08:41.971628", "exception": false, "start_time": "2021-05-19T14:08:41.482193", "status": "completed" }, "tags": [], "id": "dangerous-price" }, "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rd_clf = RandomForestClassifier()\n", "rd_clf.fit(X_train, y_train)\n", "\n", "# accuracy score, confusion matrix and classification report of random forest\n", "\n", "rd_clf_acc = accuracy_score(y_test, rd_clf.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, rd_clf.predict(X_train))}\")\n", "print(f\"Test Accuracy of Decision Tree Classifier is {rd_clf_acc} \\n\")\n", "\n", "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, rd_clf.predict(X_test))}\\n\")\n", "print(f\"Classification Report :- \\n {classification_report(y_test, rd_clf.predict(X_test))}\")" ], "id": "dangerous-price", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.097421, "end_time": "2021-05-19T14:08:59.699400", "exception": false, "start_time": "2021-05-19T14:08:59.601979", "status": "completed" }, "tags": [], "id": "vulnerable-prediction" }, "source": [ "

3. Gradient Boosting


" ], "id": "vulnerable-prediction" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:08:59.901067Z", "iopub.status.busy": "2021-05-19T14:08:59.900604Z", "iopub.status.idle": "2021-05-19T14:08:59.997423Z", "shell.execute_reply": "2021-05-19T14:08:59.997079Z" }, "papermill": { "duration": 0.199501, "end_time": "2021-05-19T14:08:59.997521", "exception": false, "start_time": "2021-05-19T14:08:59.798020", "status": "completed" }, "tags": [], "id": "mounted-smell" }, "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "\n", "gb = GradientBoostingClassifier()\n", "gb.fit(X_train, y_train)\n", "\n", "# accuracy score, confusion matrix and classification report of gradient boosting classifier\n", "\n", "gb_acc = accuracy_score(y_test, gb.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, gb.predict(X_train))}\")\n", "print(f\"Test Accuracy of Decision Tree Classifier is {gb_acc} \\n\")\n", "\n", "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, gb.predict(X_test))}\\n\")\n", "print(f\"Classification Report :- \\n {classification_report(y_test, gb.predict(X_test))}\")" ], "id": "mounted-smell", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.097718, "end_time": "2021-05-19T14:09:00.193371", "exception": false, "start_time": "2021-05-19T14:09:00.095653", "status": "completed" }, "tags": [], "id": "sonic-cliff" }, "source": [ "

4. Stochastic Gradient Boosting (SGB)


" ], "id": "sonic-cliff" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:09:00.399307Z", "iopub.status.busy": "2021-05-19T14:09:00.398815Z", "iopub.status.idle": "2021-05-19T14:09:00.493810Z", "shell.execute_reply": "2021-05-19T14:09:00.493418Z" }, "papermill": { "duration": 0.202453, "end_time": "2021-05-19T14:09:00.493927", "exception": false, "start_time": "2021-05-19T14:09:00.291474", "status": "completed" }, "tags": [], "id": "stunning-dream" }, "source": [ "sgb = GradientBoostingClassifier(subsample = 0.90, max_features = 0.70)\n", "sgb.fit(X_train, y_train)\n", "\n", "# accuracy score, confusion matrix and classification report of stochastic gradient boosting classifier\n", "\n", "sgb_acc = accuracy_score(y_test, sgb.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, sgb.predict(X_train))}\")\n", "print(f\"Test Accuracy of Decision Tree Classifier is {sgb_acc} \\n\")\n", "\n", "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, sgb.predict(X_test))}\\n\")\n", "print(f\"Classification Report :- \\n {classification_report(y_test, sgb.predict(X_test))}\")" ], "id": "stunning-dream", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.107703, "end_time": "2021-05-19T14:09:00.706867", "exception": false, "start_time": "2021-05-19T14:09:00.599164", "status": "completed" }, "tags": [], "id": "intended-karma" }, "source": [ "

5. XGboost


" ], "id": "intended-karma" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:09:00.931572Z", "iopub.status.busy": "2021-05-19T14:09:00.931140Z", "iopub.status.idle": "2021-05-19T14:09:01.173931Z", "shell.execute_reply": "2021-05-19T14:09:01.174459Z" }, "papermill": { "duration": 0.361064, "end_time": "2021-05-19T14:09:01.174641", "exception": false, "start_time": "2021-05-19T14:09:00.813577", "status": "completed" }, "tags": [], "id": "employed-expansion" }, "source": [ "from xgboost import XGBClassifier\n", "\n", "xgb = XGBClassifier(booster = 'gbtree', learning_rate = 0.1, max_depth = 5, n_estimators = 180)\n", "xgb.fit(X_train, y_train)\n", "\n", "# accuracy score, confusion matrix and classification report of xgboost\n", "\n", "xgb_acc = accuracy_score(y_test, xgb.predict(X_test))\n", "\n", "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, xgb.predict(X_train))}\")\n", "print(f\"Test Accuracy of Decision Tree Classifier is {xgb_acc} \\n\")\n", "\n", "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, xgb.predict(X_test))}\\n\")\n", "print(f\"Classification Report :- \\n {classification_report(y_test, xgb.predict(X_test))}\")" ], "id": "employed-expansion", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.108607, "end_time": "2021-05-19T14:09:06.914940", "exception": false, "start_time": "2021-05-19T14:09:06.806333", "status": "completed" }, "tags": [], "id": "cross-tourism" }, "source": [ "

Model Comparison


" ], "id": "cross-tourism" }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:09:07.140641Z", "iopub.status.busy": "2021-05-19T14:09:07.140186Z", "iopub.status.idle": "2021-05-19T14:09:07.145890Z", "shell.execute_reply": "2021-05-19T14:09:07.145469Z" }, "papermill": { "duration": 0.122396, "end_time": "2021-05-19T14:09:07.146007", "exception": false, "start_time": "2021-05-19T14:09:07.023611", "status": "completed" }, "tags": [], "id": "certain-accessory" }, "source": [ "models = pd.DataFrame({\n", " 'Model' : ['Logistic Regression', 'Gradient Boosting Classifier', 'Stochastic Gradient Boosting', 'XgBoost'], \n", " 'Score' : [lr_acc,gb_acc, sgb_acc, xgb_acc]\n", "})\n", "\n", "models.sort_values(by = 'Score', ascending = False)" ], "id": "certain-accessory", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2021-05-19T14:09:07.379088Z", "iopub.status.busy": "2021-05-19T14:09:07.378391Z", "iopub.status.idle": "2021-05-19T14:09:07.625087Z", "shell.execute_reply": "2021-05-19T14:09:07.624605Z" }, "papermill": { "duration": 0.366988, "end_time": "2021-05-19T14:09:07.625188", "exception": false, "start_time": "2021-05-19T14:09:07.258200", "status": "completed" }, "tags": [], "id": "optimum-handling" }, "source": [ "plt.figure(figsize = (15, 10))\n", "\n", "sns.barplot(x = 'Score', y = 'Model', data = models)\n", "plt.show()" ], "id": "optimum-handling", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.166379, "end_time": "2021-05-19T14:09:08.207799", "exception": false, "start_time": "2021-05-19T14:09:08.041420", "status": "completed" }, "tags": [], "id": "executive-implement" }, "source": [ "

Real Data Model Inference


" ], "id": "executive-implement" }, { "cell_type": "code", "metadata": { "id": "277219f8" }, "source": [ "pred_class = {0: 'Not Survived', 1: 'Survived'}\n", "def predict(test_input, test_target):\n", " final_results = gb.predict([test_input])\n", " test_input['Age'] = np.exp(test_input['Age'])\n", " test_input['Fare'] = np.exp(test_input['Fare'])\n", " print(f\"Features:\\n{test_input}\\n\\nClassifier prediction: {pred_class[final_results.item()]}\\nActual Target: {pred_class[test_target]}\")\n", " \n" ], "id": "277219f8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5f5011eb" }, "source": [ "test_input = X.iloc[0,:]\n", "test_target = y.iloc[0]\n", "predict(test_input, test_target)" ], "id": "5f5011eb", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "9d34da40" }, "source": [ "test_input = X.iloc[1,:]\n", "test_target = y.iloc[1]\n", "predict(test_input, test_target)" ], "id": "9d34da40", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "006866b9" }, "source": [ "

4. Approaching a Text (NLP) Problem



\n", "\n", "\n" ], "id": "006866b9" }, { "cell_type": "markdown", "metadata": { "id": "bc5ec3ec" }, "source": [ "## 4.1 Importance of solving NLP" ], "id": "bc5ec3ec" }, { "cell_type": "markdown", "metadata": { "id": "m4V2swA5V6t6" }, "source": [ "NLP — also known as computational linguistics — is the combination of AI and linguistics that allows us to talk to machines as if they were human.\n", "\n", "\n", "**In other words, NLP is an approach to process, analyze and understand large amount of text data.**" ], "id": "m4V2swA5V6t6" }, { "cell_type": "markdown", "metadata": { "id": "jM_IQ5lHWg5c" }, "source": [ "![](https://miro.medium.com/max/1400/0*Ql4uL-0XBOKWXHgk.jpg)" ], "id": "jM_IQ5lHWg5c" }, { "cell_type": "markdown", "metadata": { "id": "f3VVBg0hWv_X" }, "source": [ "### **1. Handling large volumes of text data**\n", "\n", "With the big data technology, NLP has entered the mainstream as this approach can now be applied to handle large volumes of text data via cloud/distributed computing at an unprecedented speed.\n" ], "id": "f3VVBg0hWv_X" }, { "cell_type": "markdown", "metadata": { "id": "ctqdYrZ9XF5M" }, "source": [ "### 2. Structuring highly unstructured data source\n", "\n", "Human language is astoundingly complex and diverse.\n", "\n", "NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics." ], "id": "ctqdYrZ9XF5M" }, { "cell_type": "markdown", "metadata": { "id": "pgTP_KTxYb-n" }, "source": [ "## Components of NLP\n", "\n", "NLP can be divided into two basic components.\n", "\n", "* Natural Language Understanding\n", "* Natural Language Generation\n" ], "id": "pgTP_KTxYb-n" }, { "cell_type": "markdown", "metadata": { "id": "g0vcgPzhaC0P" }, "source": [ "#### Natural Language Understanding (NLU)\n", "\n", "There are lot of ambiguity while learning or trying to interpret a language.\n", "\n", "* Lexical Ambiguity can occur when a word carries different sense\n", "* Syntactical Ambiguity means when we see more than one meaning in a sequence of words. \n", "* Referential Ambiguity: Very often a text mentions as entity (something/someone)," ], "id": "g0vcgPzhaC0P" }, { "cell_type": "markdown", "metadata": { "id": "tYtjYCaNaZKJ" }, "source": [ "### Natural Language Generation (NLG)\n", "\n", "It involves :\n", "\n", "* Text planning − It includes retrieving the relevant content from knowledge base.\n", "* Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.\n", "* Text Realization − It is mapping sentence plan into sentence structure." ], "id": "tYtjYCaNaZKJ" }, { "cell_type": "markdown", "metadata": { "id": "eceGeHrBdO30" }, "source": [ "## **Applications of NLP**" ], "id": "eceGeHrBdO30" }, { "cell_type": "markdown", "metadata": { "id": "vJuqd3lrdbbt" }, "source": [ "### Search Autocorrect and Autocomplete\n", "\n", "![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/07/na6-768x610.png)" ], "id": "vJuqd3lrdbbt" }, { "cell_type": "markdown", "metadata": { "id": "QGSNgu8mdjP2" }, "source": [ "### Language Translator\n", "\n", "![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/07/na7-768x283.png)" ], "id": "QGSNgu8mdjP2" }, { "cell_type": "markdown", "metadata": { "id": "GqLfpY43dr2Y" }, "source": [ "### Chatbots\n", "\n", "![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/07/na8-scaled-e1594202362955-768x1275.jpg)" ], "id": "GqLfpY43dr2Y" }, { "cell_type": "markdown", "metadata": { "id": "LWDzqBmSd3O9" }, "source": [ "# Sentiment analysis\n", "\n", "![](https://miro.medium.com/max/469/1*JSi4ZRojnsNZBJVtGUfXZQ.png)" ], "id": "LWDzqBmSd3O9" }, { "cell_type": "markdown", "metadata": { "id": "acdab842" }, "source": [ "

4.3 Basics of NLP


\n", "\n", "

Corpus

\n", "

A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is corpora. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc.

\n", "\n", "

Tokens

\n", "

The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens. In English, tokens correspond to words and numeric sequences separated by white-space characters or punctuation.

\n", "\n", "

Tokenization

\n", "

The process of breaking a text down into tokens is called tokenization.It can become more complicated than simply splitting text based on nonalphanumeric characters.

\n", "\n", "

N-grams

\n", "

They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward. Trigram has three token, bigram has two tokens, a unigram one.

\n", "\n", "\n", "\n" ], "id": "acdab842" }, { "cell_type": "markdown", "metadata": { "id": "7bafa6b8" }, "source": [ "

Stemming

\n", "

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.Stemming just removes the last few characters, often leading to incorrect meanings and spelling errors

\n", "\n", "

Lemmatization

\n", "

Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

" ], "id": "7bafa6b8" }, { "cell_type": "markdown", "metadata": { "id": "c48cf800" }, "source": [ "

4.4 NLP using Python


\n", "\n", "

NLTK

\n", "

It is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

\n", "\n", "

spaCy

\n", "

It is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc

" ], "id": "c48cf800" }, { "cell_type": "markdown", "metadata": { "id": "4ec1df71" }, "source": [ "

4.5 Approaching real life NLP problem


\n", "\n", "

Intro to Twitter tweets processing using Python

\n", "\n", "\n" ], "id": "4ec1df71" }, { "cell_type": "markdown", "metadata": { "id": "2b2caa89" }, "source": [ "

Problem Statement


\n", "\n", "

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. In order to get the sentiments out of tweets they need to by processed to tweaked in ways so that it could be fed into a algorithm to perform to sentiment inference. So we will take a look at pythonic approach of processing tweets

" ], "id": "2b2caa89" }, { "cell_type": "code", "metadata": { "id": "41075983" }, "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "from nltk.corpus import stopwords\n", "from nltk.util import ngrams\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from wordcloud import WordCloud, STOPWORDS\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk import word_tokenize\n", "from collections import defaultdict\n", "from collections import Counter\n", "plt.style.use('ggplot')\n", "stop=set(stopwords.words('english'))\n", "import re\n", "from nltk.tokenize import word_tokenize\n", "import gensim\n", "import string\n", "from tqdm import tqdm\n" ], "id": "41075983", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "7e030a21" }, "source": [ "tweet=pd.read_csv(\"./inputs/nlp/train.csv\")\n", "test=pd.read_csv(\"./inputs/nlp/test.csv\")" ], "id": "7e030a21", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "784f9a32" }, "source": [ "tweet.head(100)" ], "id": "784f9a32", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "927e46dc" }, "source": [ "print('There are {} rows and {} columns in train'.format(tweet.shape[0],tweet.shape[1]))\n", "print('There are {} rows and {} columns in train'.format(test.shape[0],test.shape[1]))" ], "id": "927e46dc", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5c3b6dd1" }, "source": [ "

Class Distribution

\n", "\n", "

Before we begin with anything else,let's check the class distribution.There are only two classes 0 and 1.


\n" ], "id": "5c3b6dd1" }, { "cell_type": "code", "metadata": { "id": "f369049e" }, "source": [ "x=tweet.target.value_counts()\n", "sns.barplot(x.index,x)\n", "plt.gca().set_ylabel('samples')" ], "id": "f369049e", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "a5dfef16" }, "source": [ "

Exploratory Data Analysis (EDA)


" ], "id": "a5dfef16" }, { "cell_type": "code", "metadata": { "id": "e9f68ec8" }, "source": [ "#First,we will do very basic analysis,that is character level,word level and sentence level analysis.\n", "fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))\n", "tweet_len=tweet[tweet['target']==1]['text'].str.len()\n", "ax1.hist(tweet_len,color='red')\n", "ax1.set_title('disaster tweets')\n", "tweet_len=tweet[tweet['target']==0]['text'].str.len()\n", "ax2.hist(tweet_len,color='green')\n", "ax2.set_title('Not disaster tweets')\n", "fig.suptitle('Characters in tweets')\n", "plt.show()\n", "\n", "print(\"The distribution of both seems to be almost same.120 t0 140 characters in a tweet are the most common among both.\")" ], "id": "e9f68ec8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "15e192cb" }, "source": [ "# Number of words in tweet\n", "fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))\n", "tweet_len=tweet[tweet['target']==1]['text'].str.split().map(lambda x: len(x))\n", "ax1.hist(tweet_len,color='red')\n", "ax1.set_title('disaster tweets')\n", "tweet_len=tweet[tweet['target']==0]['text'].str.split().map(lambda x: len(x))\n", "ax2.hist(tweet_len,color='green')\n", "ax2.set_title('Not disaster tweets')\n", "fig.suptitle('Words in a tweet')\n", "plt.show()\n", "\n", "print(\"Most of disaster tweets are around between 10 -20 word counts\\nand Non disaster tweet areound 15 to 20\")" ], "id": "15e192cb", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2ae980eb" }, "source": [ "# Average word length in a tweet\n", "fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))\n", "word=tweet[tweet['target']==1]['text'].str.split().apply(lambda x : [len(i) for i in x])\n", "sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1,color='red')\n", "ax1.set_title('disaster')\n", "word=tweet[tweet['target']==0]['text'].str.split().apply(lambda x : [len(i) for i in x])\n", "sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green')\n", "ax2.set_title('Not disaster')\n", "fig.suptitle('Average word length in each tweet')\n" ], "id": "2ae980eb", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "a26f0c7c" }, "source": [ "

Creating Corpus

\n", "\n", "

We need to perform further EDA we need corpus


\n" ], "id": "a26f0c7c" }, { "cell_type": "code", "metadata": { "id": "7fa39f5c" }, "source": [ "#creating text corpus based on target\n", "#corpus of Disaster vs Non Disaster Tweet\n", "\n", "def build_corpus(target):\n", " corpus=[]\n", " \n", " for x in tweet[tweet['target']==target]['text'].str.split():\n", " for i in x:\n", " corpus.append(i)\n", " return corpus" ], "id": "7fa39f5c", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4a33faed" }, "source": [ "#wordcloud of non disaster tweets\n", "comment_words = ''\n", "stopwords = set(STOPWORDS)\n", "comment_words += \" \".join(build_corpus(0))+\" \"\n", "\n", "wordcloud = WordCloud(width = 800, height = 800,\n", " background_color ='white',\n", " stopwords = stopwords,\n", " min_font_size = 10).generate(comment_words)\n", "\n", "plt.figure(figsize = (8, 8), facecolor = None)\n", "plt.imshow(wordcloud)\n", "plt.axis(\"off\")\n", "plt.tight_layout(pad = 0)\n", "plt.show()" ], "id": "4a33faed", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4b5f89cf" }, "source": [ "#wordcloud of disaster tweets\n", "comment_words = ''\n", "stopwords = set(STOPWORDS)\n", "comment_words += \" \".join(build_corpus(1))+\" \"\n", "\n", "wordcloud = WordCloud(width = 800, height = 800,\n", " background_color ='black',\n", " stopwords = stopwords,\n", " min_font_size = 10).generate(comment_words)\n", "\n", "plt.figure(figsize = (8, 8), facecolor = None)\n", "plt.imshow(wordcloud)\n", "plt.axis(\"off\")\n", "plt.tight_layout(pad = 0)\n", " \n", "plt.show()" ], "id": "4b5f89cf", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "143d08ef" }, "source": [ "#creating corpus based of Non Disaster Tweet\n", "corpus=build_corpus(0)\n", "\n", "#build dicr based on NoN disaster tweet\n", "dic=defaultdict(int)\n", "for word in corpus:\n", " if word in stop:\n", " dic[word]+=1\n", " \n", "#choosing top tweets\n", "top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] \n", "top" ], "id": "143d08ef", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "443c251d" }, "source": [ "#bar chart showing common words in NoN disaster tweets\n", "x,y = zip(*top)\n", "plt.figure(figsize=(15,8))\n", "plt.bar(x, y);" ], "id": "443c251d", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "f776c8cf" }, "source": [ "#creating the same for Diaster Tweet\n", "corpus=build_corpus(1)\n", "\n", "dic=defaultdict(int)\n", "for word in corpus:\n", " if word in stop:\n", " dic[word]+=1\n", " \n", "top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] \n", " \n", "plt.figure(figsize=(15,8))\n", "x,y=zip(*top)\n", "plt.bar(x,y)\n", "plt.show()\n", "\n", "print(\"Words like 'the','in','of' dominates, but these words have not much meaning or prediction capabilities\")" ], "id": "f776c8cf", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "349c8bd4" }, "source": [ "#Analysing punctuations for Disaster Tweets\n", "plt.figure(figsize=(15,8))\n", "corpus=build_corpus(1)\n", "\n", "dic=defaultdict(int)\n", "import string\n", "special = string.punctuation\n", "for i in (corpus):\n", " if i in special:\n", " dic[i]+=1\n", " \n", "x,y=zip(*dic.items())\n", "plt.title(\"Analysing punctuations for Disaster Tweets\")\n", "plt.bar(x,y,color='orange');\n" ], "id": "349c8bd4", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "e954149c" }, "source": [ "#Analysing punctuations for Non-Disaster Tweets\n", "plt.figure(figsize=(15,8))\n", "corpus=build_corpus(0)\n", "\n", "dic=defaultdict(int)\n", "import string\n", "special = string.punctuation\n", "for i in (corpus):\n", " if i in special:\n", " dic[i]+=1\n", " \n", "x,y=zip(*dic.items())\n", "plt.title(\"Analysing punctuations for Non-Disaster Tweets\")\n", "plt.bar(x,y,color='red');" ], "id": "e954149c", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1c946180" }, "source": [ "#performing the same for complete corpus\n", "complete_corpus = []\n", "complete_corpus.extend(build_corpus(0))\n", "complete_corpus.extend(build_corpus(1))\n", "counter=Counter(corpus)\n", "most=counter.most_common()\n", "x=[]\n", "y=[]\n", "for word,count in most[:40]:\n", " if (word not in stop) :\n", " x.append(word)\n", " y.append(count)\n", "plt.figure(figsize=(15,8))\n", "sns.barplot(x=y,y=x).set_title(\"Complete corpus common words\")\n", "plt.show()" ], "id": "1c946180", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7f68c536" }, "source": [ "

N-gram Analysis

\n", "\n", "

We can perform Bigram and Trigram analysis on this data to understand word combinations in the dataset


\n", "\n" ], "id": "7f68c536" }, { "cell_type": "code", "metadata": { "id": "94e5b1c0" }, "source": [ "def get_top_tweet_bigrams(corpus, gram=2, n=None):\n", " vec = CountVectorizer(ngram_range=(gram, gram)).fit(corpus)\n", " bag_of_words = vec.transform(corpus)\n", " sum_words = bag_of_words.sum(axis=0) \n", " words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]\n", " words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\n", " return words_freq[:n]" ], "id": "94e5b1c0", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ad2b33c9" }, "source": [ "#count of common bigrams\n", "plt.figure(figsize=(15,8))\n", "top_tweet_bigrams=get_top_tweet_bigrams(tweet['text'], gram=2)[:10]\n", "x,y=map(list,zip(*top_tweet_bigrams))\n", "sns.barplot(x=y,y=x).set_title(\"Common Bigram Count\")\n", "plt.show()" ], "id": "ad2b33c9", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ddcd245e" }, "source": [ "#count of common trigrams\n", "plt.figure(figsize=(15,8))\n", "top_tweet_trigrams=get_top_tweet_bigrams(tweet['text'], gram=3)[:10]\n", "x,y=map(list,zip(*top_tweet_trigrams))\n", "sns.barplot(x=y,y=x).set_title(\"Common Trigram Count\")\n", "plt.show()" ], "id": "ddcd245e", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7c7f5e82" }, "source": [ "

After performing Bigram and Trigram analysis on this data clearly shows the need of data cleaning


" ], "id": "7c7f5e82" }, { "cell_type": "markdown", "metadata": { "id": "4199317e" }, "source": [ "

Data Cleaning

\n", "\n", "

Raw twitter data contains a lot of noise like stopwords, irrelavant punctuations, emojis ..etc which could kill its prediction capabilities. So let's clean them


\n", "\n" ], "id": "4199317e" }, { "cell_type": "code", "metadata": { "id": "d91d7534" }, "source": [ "df=pd.concat([tweet,test])\n", "df.shape" ], "id": "d91d7534", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "80618224" }, "source": [ "

Lemmatization

" ], "id": "80618224" }, { "cell_type": "code", "metadata": { "id": "85f6917a" }, "source": [ "def lemma(text):\n", " text = text.lower()\n", " lmtzr = WordNetLemmatizer()\n", " text = \" \".join([lmtzr.lemmatize(word) for word in word_tokenize(text)])\n", " return text\n", "\n", "ex = \"Our Deeds are the Reason of this #earthquake\"\n", "lemma(ex)" ], "id": "85f6917a", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "57fb679f" }, "source": [ "#apply lemmatization\n", "df['text']=df['text'].apply(lambda x : lemma(x))" ], "id": "57fb679f", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "22b2f525" }, "source": [ "

Removing URL

" ], "id": "22b2f525" }, { "cell_type": "code", "metadata": { "id": "3ab272e6" }, "source": [ "#Regex to remove URL\n", "def remove_URL(text):\n", " url = re.compile(r'https?://\\S+|www\\.\\S+')\n", " return url.sub(r'',text)\n", "\n", "sample_URL = \"please find the titanic project at https://www.kaggle.com/c/titanic/overview\"\n", "remove_URL(sample_URL)" ], "id": "3ab272e6", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1a77ce09" }, "source": [ "#Apply the cleaning process\n", "df['text']=df['text'].apply(lambda x : remove_URL(x))" ], "id": "1a77ce09", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0df25952" }, "source": [ "df[['text','target']].head()" ], "id": "0df25952", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3d530aa8" }, "source": [ "

Removing HTML

" ], "id": "3d530aa8" }, { "cell_type": "code", "metadata": { "id": "90cb2b2b" }, "source": [ "#remove html\n", "def remove_html(text):\n", " html=re.compile(r'<.*?>')\n", " return html.sub(r'',text)\n", "\n", "sample_HTML = '''\n", "\n", "
\n", "

Welcome to HTML\n", "kaggle\n", "

test paragraph

\n", "

\n", "\n", "'''\n", "print(remove_html(sample_HTML))" ], "id": "90cb2b2b", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "a2188b79" }, "source": [ "" ], "id": "a2188b79", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "c9ef8c65" }, "source": [ "df['text']=df['text'].apply(lambda x : remove_html(x))\n", "df[['text','target']].head()" ], "id": "c9ef8c65", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "e7e0769b" }, "source": [ "

Removing Emojis 😔

\n", "\n", "

Emoji's have great prediction capabilities but they need to be encoded to some numeric problem, Lets just remove it for simplicity sake


\n", "\n" ], "id": "e7e0769b" }, { "cell_type": "code", "metadata": { "id": "5a4ebe7d" }, "source": [ "def remove_emoji(text):\n", " emoji_pattern = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\" # emoticons\n", " u\"\\U0001F300-\\U0001F5FF\" # symbols & pictographs\n", " u\"\\U0001F680-\\U0001F6FF\" # transport & map symbols\n", " u\"\\U0001F1E0-\\U0001F1FF\" # flags (iOS)\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " \"]+\", flags=re.UNICODE)\n", " return emoji_pattern.sub(r'', text)\n" ], "id": "5a4ebe7d", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8eaf1eb0" }, "source": [ "\n", "remove_emoji(\"its flooding i am scared 😔😔\")\n", "\n" ], "id": "8eaf1eb0", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "a6f0d4cb" }, "source": [ "df['text']=df['text'].apply(lambda x: remove_emoji(x))\n", "df[['text','target']].head()" ], "id": "a6f0d4cb", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "82c2b58a" }, "source": [ "

Removing punctuations

\n", "\n", "

Just like emojis punctuation also posses some kind of meaning like :) :( ..etc ,\n", "But its hard to encode such punctation to numeric formats when vectorizing text. So we can remove it


\n", "\n" ], "id": "82c2b58a" }, { "cell_type": "code", "metadata": { "id": "4eec6b24" }, "source": [ "def remove_punct(text):\n", " table=str.maketrans('','',string.punctuation)\n", " return text.translate(table)\n", "\n", "example=\"There are #earthquakes\"\n", "print(remove_punct(example))" ], "id": "4eec6b24", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3df59cc1" }, "source": [ "df['text']=df['text'].apply(lambda x : remove_punct(x))\n", "df[['text','target']].head()" ], "id": "3df59cc1", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3313ced9" }, "source": [ "

Vectorization

\n", "\n", "

Text needs to be encoded to numerical format before we feed it into a machine learning algorithm


\n", "\n", "

Bag-Of-Words - The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.


\n", "\n", "\n", "\n" ], "id": "3313ced9" }, { "cell_type": "markdown", "metadata": { "id": "5d9c77f5" }, "source": [ "\n", "

BOW vectorization techniques

\n", "\n", "

1. convert text to word count vectors with CountVectorizer
2. convert text to word frequency vectors with TfidfVectorizer.


\n", "\n" ], "id": "5d9c77f5" }, { "cell_type": "markdown", "metadata": { "id": "8b09721c" }, "source": [ "

Word Counts with CountVectorizer

\n", "\n", "\n", "

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.


\n", "\n" ], "id": "8b09721c" }, { "cell_type": "code", "metadata": { "id": "40024021" }, "source": [ "# list of text documents\n", "text = [\"heard about the earthquake is different city stay safe everyone and hope the city fast recovery\"]\n", "# create the transform\n", "vectorizer = CountVectorizer()\n", "# tokenize and build vocab\n", "vectorizer.fit(text)\n", "# summarize\n", "print(vectorizer.vocabulary_)\n", "# encode document\n", "vector = vectorizer.transform(text)\n", "# summarize encoded vector\n", "print(vector.shape)\n", "print(type(vector))\n", "print(vector.toarray())" ], "id": "40024021", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4ba4c155" }, "source": [ "#vectorizing the whole tweet\n", "vectorizer = CountVectorizer()\n", "vectorizer.fit(df.text.values)\n", "vector = vectorizer.transform(df.text.values)\n", "print(vector.shape)" ], "id": "4ba4c155", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "69ef10ea" }, "source": [ "#to see the vocab dictionary \n", "# print(vectorizer.vocabulary_)\n" ], "id": "69ef10ea", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3f078b9d" }, "source": [ "

Word Frequencies with TfidfVectorizer

\n", "\n", "

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.


\n", "\n", "

* Term Frequency: This summarizes how often a given word appears within a document.
* Inverse Document Frequency: This downscales words that appear a lot across documents.


\n", "\n" ], "id": "3f078b9d" }, { "cell_type": "code", "metadata": { "id": "fc48325d" }, "source": [ "# list of text documents\n", "text = [\"The quick brown fox jumped over the lazy dog.\",\"The dog.\",\"The fox\"]\n", "# create the transform\n", "vectorizer = TfidfVectorizer()\n", "# tokenize and build vocab\n", "vectorizer.fit(text)\n", "# summarize\n", "print(vectorizer.vocabulary_)\n", "print(vectorizer.idf_)\n", "# encode document\n", "vector = vectorizer.transform([text[0]])\n", "# summarize encoded vector\n", "print(vector.shape)\n", "print(vector.toarray())" ], "id": "fc48325d", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ad881af0" }, "source": [ "vectorizer = TfidfVectorizer()\n", "# tokenize and build vocab\n", "vectorizer.fit(df.text.values);" ], "id": "ad881af0", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3767faee" }, "source": [ "# print(vectorizer.vocabulary_)" ], "id": "3767faee", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3f05a6f7" }, "source": [ "print(vectorizer.idf_.shape)\n", "print(vectorizer.idf_)" ], "id": "3f05a6f7", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2ee8db5c" }, "source": [ "vector = vectorizer.transform([text[0]])\n", "# summarize encoded vector\n", "print(vector.shape)\n", "print(vector.toarray())" ], "id": "2ee8db5c", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "_2Pcu3EIMtQL" }, "source": [ "------" ], "id": "_2Pcu3EIMtQL" }, { "cell_type": "markdown", "metadata": { "id": "YLNGW22-lOYv" }, "source": [ "# **What is Computer Vision**\n", "\n", "Two definitions of computer vision:\n", "\n", "**Computer vision can be defined as\n", "a scientific field that extracts information out of digital images**. The type of\n", "information gained from an image can vary from identification, space\n", "measurements for navigation, or augmented reality applications.\n", "

\n", "Another way to define computer vision is through its applications.\n", "\n", "**Computer vision is building algorithms that can understand the content\n", "of images and use it for other applications**.\n", "\n", "\n", "
\n", "
\n", "Computer vision at\n", "the intersection of multiple\n", "scientific fields\n", "
\n", "\n", "Source: Computer Vision: Foundation and Applications by Ranjay Krishna - Stanford University\n" ], "id": "YLNGW22-lOYv" }, { "cell_type": "markdown", "metadata": { "id": "MM3XTRZFLSFC" }, "source": [ "
\n", "
\n", "\n", " Source: Lecture'1: Introductionto Computer Vision by Prof. FeiBFei Li\n", "Stanford Vision Lab\n", "\n" ], "id": "MM3XTRZFLSFC" }, { "cell_type": "markdown", "metadata": { "id": "r1QD2_khN2QK" }, "source": [ "
\n", "
\n", "\n", " Source: Lecture'1: Introductionto Computer Vision by Prof. FeiBFei Li\n", "Stanford Vision Lab" ], "id": "r1QD2_khN2QK" }, { "cell_type": "markdown", "metadata": { "id": "E9qfgEIfUMSW" }, "source": [ "

Why Computer Vision Matter

\n", "\n", "\n", "
\n", "\n", "
" ], "id": "E9qfgEIfUMSW" }, { "cell_type": "markdown", "metadata": { "id": "19MIfcFkng7a" }, "source": [ "# **How is computer vision used today?**" ], "id": "19MIfcFkng7a" }, { "cell_type": "markdown", "metadata": { "id": "8bDHgRMNo8aE" }, "source": [ "

Optical character recognition (OCR)

\n", "\n", "Technology to convert scanned docs to text\n", "\n", "
\n", "
\n", "License plate readers\n", "
\n", "\n" ], "id": "8bDHgRMNo8aE" }, { "cell_type": "markdown", "metadata": { "id": "PyFL7J36YRdr" }, "source": [ "

Face detection

\n", "\n", "
\n", "
\n", "
\n" ], "id": "PyFL7J36YRdr" }, { "cell_type": "markdown", "metadata": { "id": "CsvjHYaNYRXD" }, "source": [ "

Face analysis and recognition

\n", "\n", "
\n", "\n", "
" ], "id": "CsvjHYaNYRXD" }, { "cell_type": "markdown", "metadata": { "id": "13YvD2iqZRLV" }, "source": [ "

Login without a password

\n", "\n", "
\n", "\n", "
" ], "id": "13YvD2iqZRLV" }, { "cell_type": "markdown", "metadata": { "id": "6o0LGU0caCb2" }, "source": [ "

Special effects: shape capture

\n", "\n", "
\n", "\n", "
" ], "id": "6o0LGU0caCb2" }, { "cell_type": "markdown", "metadata": { "id": "ILBgNeRXaCku" }, "source": [ "

Special effects: motion capture

\n", "\n", "
\n", "\n", "
" ], "id": "ILBgNeRXaCku" }, { "cell_type": "markdown", "metadata": { "id": "9tlPJSA_aCt-" }, "source": [ "

3D face tracking with consumer cameras

\n", "\n", "
\n", "\n", "
" ], "id": "9tlPJSA_aCt-" }, { "cell_type": "markdown", "metadata": { "id": "M9oJ5P7paC4E" }, "source": [ "

Image synthesis

\n", "\n", "
\n", "\n", "
" ], "id": "M9oJ5P7paC4E" }, { "cell_type": "markdown", "metadata": { "id": "tmw6L4NBb01d" }, "source": [ "

Smart Cars

\n", "\n", "
\n", "\n", "
" ], "id": "tmw6L4NBb01d" }, { "cell_type": "markdown", "metadata": { "id": "xLm5oI6ib0vn" }, "source": [ "

Self Driving Cars

\n", "\n", "
\n", "\n", "
" ], "id": "xLm5oI6ib0vn" }, { "cell_type": "markdown", "metadata": { "id": "j8vvL2U2b08Q" }, "source": [ "

Image synthesis

\n", "\n", "
\n", "\n", "
" ], "id": "j8vvL2U2b08Q" }, { "cell_type": "markdown", "metadata": { "id": "9r8OFeY2mgVm" }, "source": [ "# **Challenges**\n", "\n", "
\n", "
" ], "id": "9r8OFeY2mgVm" }, { "cell_type": "markdown", "metadata": { "id": "AJdmshPZmy6s" }, "source": [ "
" ], "id": "AJdmshPZmy6s" }, { "cell_type": "markdown", "metadata": { "id": "6JN1V7LgM8xa" }, "source": [ "---" ], "id": "6JN1V7LgM8xa" }, { "cell_type": "markdown", "metadata": { "id": "m_eHPt4V0wxK" }, "source": [ "# **Image Processing**" ], "id": "m_eHPt4V0wxK" }, { "cell_type": "markdown", "metadata": { "id": "1peauNei7FPt" }, "source": [ "## **Point Operators**\n", "\n", "The simplest kind of image processing transforms are point operators, where each output pixel value depends on only the corresponding input pixel value (plus some globally collected information or parameters)" ], "id": "1peauNei7FPt" }, { "cell_type": "markdown", "metadata": { "id": "E8sx171T7oTE" }, "source": [ "### **Pixel Transforms**\n", "\n", "Pixel transforms are applying a funtion over each pixel value in the images.\n", "It can be a linear function with a gain and bias, which are used in brightness and contrast control, or can be a non-linear function which are used in gamma corrections in digital cameras\n" ], "id": "E8sx171T7oTE" }, { "cell_type": "markdown", "metadata": { "id": "JbFnGjoY7oEo" }, "source": [ "### **Color transforms**\n", "\n", "Brightening a picture by adding a constant value to all three color channels created undesired side effects like affecting its hue and saturation.\n", "\n", "chromaticity coordinates or even simpler color ratios\n", "can first be computed and then used after manipulating (e.g., brightening) the luminance Y to re-compute a valid RGB image with the same hue and saturation." ], "id": "JbFnGjoY7oEo" }, { "cell_type": "markdown", "metadata": { "id": "3CjLqIgwnlNp" }, "source": [ "### **Compositing and matting**\n", "\n", "The process of extracting the object from the original image is often called matting. \n", "
\n", "The process of inserting it into another image (without visible artifacts) is called compositing.\n", "

\n", "\n", "
\n", "\n", "(a) Original Image\n", "(b) alpha-matted color image\n", "(c) alpha channel \n", "(d) Composited image" ], "id": "3CjLqIgwnlNp" }, { "cell_type": "markdown", "metadata": { "id": "oGBCVKLPpULi" }, "source": [ "### **Histogram equalization**\n", "\n", "How can be determine the best values for brightness and gain controls?\n", "\n", "The answer is to plot the histogram of the individual color channels and luminance values. From this distribution, we can compute relevant statistics\n", "such as the minimum, maximum, and average intensity values\n", "\n", "
\n", "Example: Histogram equalization\n", "
\n", "
\n", "\n", "\n" ], "id": "oGBCVKLPpULi" }, { "cell_type": "markdown", "metadata": { "id": "ExiddWSMZ_YS" }, "source": [ "## **Linear Filtering**\n", "\n", "Neighborhood operator or local operator, which uses a collection of pixel values in the vicinity of a given pixel to determine its final output value.\n", "\n", "In addition to performing local tone adjustment, neighborhood operators can be used to filter images to add soft blur, sharpen details, accentuate edges, or remove noise." ], "id": "ExiddWSMZ_YS" }, { "cell_type": "markdown", "metadata": { "id": "-tzQru1Iewz6" }, "source": [ "### **1. Seperable Filtering**\n", "\n", "The process of performing convolution operation can be sped up by a convolution kernel which is called seperable.\n", "\n", "### **2. Band-pass and steerable filters**\n", "\n", "Band-pass filters filter out both low and high frequencies which kernel can be created by first smoothing the image with a Guassian filter and then taking the first or second derivatives." ], "id": "-tzQru1Iewz6" }, { "cell_type": "markdown", "metadata": { "id": "5UVlzxXojSRy" }, "source": [ "## **More neighborhood operators**\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "id": "5UVlzxXojSRy" }, { "cell_type": "markdown", "metadata": { "id": "HinLkjNrqPEl" }, "source": [ "### **1. Non-linear filtering**\n", "\n", "Consider an image with 'shot noise', noise with occasioanlly large pixel values. Using a linear filter such as regular blurring with gaussian filter will only turn those large values into softer spots.\n", "\n", "**Median filtering**\n", "
\n", "Selects the median value from each pixel’s neighborhood. Since the shot\n", "noise value usually lies well outside the true values in the neighborhood, the median filter is able to filter away such bad pixels." ], "id": "HinLkjNrqPEl" }, { "cell_type": "markdown", "metadata": { "id": "CVr8l1F2qRJt" }, "source": [ "### **2. Bilateral filtering**\n", "\n", "In the bilateral filter, the output pixel value depends on a weighted combination of neighboring pixel values. The idea is to simply reject (in a soft way) pixels whose values differ too much from the central pixel value." ], "id": "CVr8l1F2qRJt" }, { "cell_type": "markdown", "metadata": { "id": "uuRGmzljqTYs" }, "source": [ "### **Binary image processing**\n", "\n", "Binary images are thresholded grey scale image with pixel value either 0 or 1.\n", "
\n", "#### **Morphology**\n", "we first convolve the binary image with a binary structuring element and then select a binary output value depending on the thresholded result of the convolution.\n", "
\n", "Examples are **dilation, erosion, majority, opening, closing**\n", "
\n", "\n", "### **Distance Transforms**\n", "\n", "Also known as distance map or distance field, is a derived representation of a digital image.\n", "
\n", "The map labels each pixel of the image with the distance to the nearest obstacle pixel. A most common type of obstacle pixel is a boundary pixel in a binary image." ], "id": "uuRGmzljqTYs" }, { "cell_type": "markdown", "metadata": { "id": "lkIntz82rax1" }, "source": [ "## **Fourier Transforms**" ], "id": "lkIntz82rax1" }, { "cell_type": "markdown", "metadata": { "id": "P1qyn6uE5QJg" }, "source": [ "Is used to decompose an image into its sine and cosine components. The output of the transformation represents the image in the Fourier or frequency domain, while the input image is the spatial domain equivalent.\n", "\n", "The **Fast Fourier Transforms** is used to perform large-kernel operation in time that is independent of the kernel size." ], "id": "P1qyn6uE5QJg" }, { "cell_type": "markdown", "metadata": { "id": "2kXGT9tX7nKo" }, "source": [ "### **Two-dimensional Fourier transforms**\n", "\n", "#### **Wiener filtering***\n", "The fourier tranform can be sued to analyze the frequency spectrum of a whole class of image.
\n", "\n", "This linear filtering technique assumes that if noise is present in the system, then it is considered to be additive white Gaussian noise (AWGN). (*Not used in practice today)" ], "id": "2kXGT9tX7nKo" }, { "cell_type": "markdown", "metadata": { "id": "NvgIDyUm9DYJ" }, "source": [ "#### **Discrete cosine transform**\n", "\n", "The discrete cosine transform (DCT) is a variant of the Fourier transform particularly well suited to compressing images in a block-wise fashion.\n", "\n", "The DCT is widely used in today’s image and video compression algorithms." ], "id": "NvgIDyUm9DYJ" }, { "cell_type": "markdown", "metadata": { "id": "Sf4TBpvH9YqI" }, "source": [ "## **Pyramids and wavelets**\n", "\n", "We may need to alter the resolution of an image for speeding up an processing alogorithm or to match the resolution of a printer or a screen.\n" ], "id": "Sf4TBpvH9YqI" }, { "cell_type": "markdown", "metadata": { "id": "tYuWtvP39nWv" }, "source": [ "### **Interpolation**\n", "\n", "Image interpolation (or upsample) occurs when you resize or distort your image from one pixel grid to another.\n", "
\n", "\n", "Works by using known data to estimate values at unknown points. Common methods can be grouped into two categories:\n", "\n", "#### **Adaptive methods**\n", "\n", "Adaptive methods change depending on what they are interpolating. Examples include many proprietary algorithms in licensed software such as: Qimage, PhotoZoom Pro and Genuine Fractals.\n", "\n", "#### **Non-adaptive methods**\n", "Non-adaptive methods treat all pixels equally and examples are nearest neighbor, bilinear, bicubic, spline, sinc, lanczos etc.\n", "\n", "\n", "\n" ], "id": "tYuWtvP39nWv" }, { "cell_type": "markdown", "metadata": { "id": "dMiH8vEcARbg" }, "source": [ "A camera performs an optical zoom by moving the zoom lens so that it increases the magnification of light. However, a digital zoom degrades quality by simply interpolating the image. Even though the photo with digital zoom contains the same number of pixels, the detail is clearly far less than with optical zoom." ], "id": "dMiH8vEcARbg" }, { "cell_type": "markdown", "metadata": { "id": "LiB-Cbu1AWzB" }, "source": [ "### **Decimation**\n", "\n", "Reducing resolution of an image (downsampling). This is done by convolving the image with a low-pass filter and then keep every rth sample. Some examples of decimations filters are Direct subsampling, Block averaging, Sinc function.\n", "\n", "Applications include Image compression, Limited bandwidth image transmission etc.\n" ], "id": "LiB-Cbu1AWzB" }, { "cell_type": "markdown", "metadata": { "id": "M3fhbJspClL5" }, "source": [ "### **Multi-resolution representations**\n", "\n", "Consider a task of finding a face in an image, since we don't know what is the size of the face in an image, we can construct a pyramid of differently sized images and scan each one for possible faces.\n" ], "id": "M3fhbJspClL5" }, { "cell_type": "markdown", "metadata": { "id": "2cDDgYT3EEni" }, "source": [ "#### **Laplacian pyramid**\n", "To construct the pyramid, we first blur and subsample the original image by a factor of two and store this in the next level of the pyramid. Also known as an *octave pyramid*.\n", "\n", "Image pyramids are extremely useful for performing multi-scale editing operations such as blending images while maintaining details." ], "id": "2cDDgYT3EEni" }, { "cell_type": "markdown", "metadata": { "id": "gWFO-z3mEFi7" }, "source": [ "### **Wavelets**\n", "\n", "Wavelets provide a smooth way to decompose a signal into frequency components without blocking and are closely related to pyramids.\n", "\n", "Continuous Wavelet Transform, Inverse continuous wavelet transform, Discrete Wavelet Transform are three main methods of wavelets transforms.\n", "\n", "Application of wavelets include De-noising, Compression and Image fusion.\n", "\n" ], "id": "gWFO-z3mEFi7" }, { "cell_type": "markdown", "metadata": { "id": "Yf6Ox5SdF3l5" }, "source": [ "## **Geometric transformations**" ], "id": "Yf6Ox5SdF3l5" }, { "cell_type": "markdown", "metadata": { "id": "edmCMfB7F__L" }, "source": [ "### **Parametric transformations**\n", "\n", "Parametric transformations apply a global deformation to an image, where the behavior of the transformation is controlled by a small number of parameters.\n", "\n", "#### **MIP-mapping**\n", "\n", "A MIP-map is a standard image pyramid, where each level is prefiltered with a high-quality filter rather than a poorer quality approximation. \n", "\n", "Computer graphics rendering APIs, such as OpenGL and Direct3D, have parameters that can be used to select which variant of MIP-mapping should be used, depending on the desired tradeoff between speed and quality.\n", "\n", "#### **Anisotropic filtering**\n", "\n", "An alternative approach to filtering oriented textures, which is sometimes implemented in graphics hardware (GPUs), is to use anisotropic filtering.\n", "\n", "#### **Multi-pass transforms**\n", "\n", "The advantage of using a series of one-dimensional transforms is that they are much more efficient (in terms of basic arithmetic operations) than large, non-separable, two-dimensional filter kernels.\n", "\n", "\n" ], "id": "edmCMfB7F__L" }, { "cell_type": "markdown", "metadata": { "id": "MrYe1pjtId5r" }, "source": [ "### **Mesh-based warping**\n", "\n", "For example, changing the appearance of a face from a frown to a smile, what is needed in this case is to curve the corners of the mouth upwards while leaving the rest of the face intact. To perform such a transformation, different amounts of motion are required in different parts of the image.\n", "\n", "The mesh-warping algorithm relates features with nonuniform mesh in the source and destination images, i.e., the images are broken up into small regions that are mapped onto each other for the morph.\n" ], "id": "MrYe1pjtId5r" }, { "cell_type": "markdown", "metadata": { "id": "ftW6RQOUJkAy" }, "source": [ "---" ], "id": "ftW6RQOUJkAy" }, { "cell_type": "markdown", "metadata": { "id": "29a9e7f9" }, "source": [ "# **OpenCV**\n", "\n", "An opensource library which is supported by multiple platforms including Linux, Windows, Android and MacOS. \n", "It supports a wide variety of programming languages such as C++, Python, Java etc.\n", "
\n", "OpenCV-Python is the Python API for OpenCV, combining the best qualities of the OpenCV C++ API and the Python language." ], "id": "29a9e7f9" }, { "cell_type": "markdown", "metadata": { "id": "075a1dd5" }, "source": [ "### Pre-requisites\n", "\n", "1. Python 3.x\n", "2. Pip " ], "id": "075a1dd5" }, { "cell_type": "markdown", "metadata": { "id": "07abb1c1" }, "source": [ "To make sure that the Python version you currently have in your system is 3.x, run the below command on your systems terminal or command Prompt\n", "\n", "```powershell\n", "$ python3 --version\n", "```\n", "To make sure that PIP package manager is installed on your system\n", "\n", "```powershell\n", "$ pip3 --version\n", "```\n", "\n", "Or you can confirm the same by executing the below cells\n" ], "id": "07abb1c1" }, { "cell_type": "code", "metadata": { "id": "acfe8e85" }, "source": [ "! echo 'Python Version: '$(python3 --version)\n" ], "id": "acfe8e85", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4323dd68" }, "source": [ "! echo 'PIP Version: ' $(pip3 --version)" ], "id": "4323dd68", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "iwgR_gMlEgq_" }, "source": [ "\n", "\n", "---\n", "\n" ], "id": "iwgR_gMlEgq_" }, { "cell_type": "markdown", "metadata": { "id": "unpEp_uyEXWg" }, "source": [ "## **Installing OpenCV-Python**" ], "id": "unpEp_uyEXWg" }, { "cell_type": "markdown", "metadata": { "id": "e5FdzMxPEtPu" }, "source": [ "The [OpenCV-Python](https://pypi.org/project/opencv-python/) package can be installed using the [PIP package manager](https://pip.pypa.io/en/stable/).\n", "\n", "To install, run the below command on your terminal or command prompt\n", "\n", "```powershell\n", "$ pip3 install opencv-python\n", "```" ], "id": "e5FdzMxPEtPu" }, { "cell_type": "markdown", "metadata": { "id": "yHWa7omzFtGS" }, "source": [ "You can do the same on this notebook as well, by executing the below cell" ], "id": "yHWa7omzFtGS" }, { "cell_type": "code", "metadata": { "id": "Oq4UQGrpD7Fi" }, "source": [ "! pip3 install opencv-python" ], "id": "Oq4UQGrpD7Fi", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YLw9IgCRGVsp" }, "source": [ "After installation is done, confirm the same by checking the version of the package" ], "id": "YLw9IgCRGVsp" }, { "cell_type": "code", "metadata": { "id": "68qHIL7iHDSt" }, "source": [ "import cv2\n", "print(f'OpenCV-Python Version: {cv2.__version__}')" ], "id": "68qHIL7iHDSt", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "949CM7VAIb5B" }, "source": [ "\n", "\n", "---\n", "\n" ], "id": "949CM7VAIb5B" }, { "cell_type": "markdown", "metadata": { "id": "WCl4_sWxG_nZ" }, "source": [ "## **Now, let's dive into OpenCV**" ], "id": "WCl4_sWxG_nZ" }, { "cell_type": "markdown", "metadata": { "id": "zfQ9Q0HwMact" }, "source": [ "Before diving into OpenCV, let's import few packages that we will be using throghout this notebook\n", "1. OpenCV\n", "2. [Numpy](https://numpy.org/)\n", "3. [Matplotlib](https://matplotlib.org/)\n", "4. [os](https://docs.python.org/3/library/os.html)\n", "\n", "run the below cell to import them" ], "id": "zfQ9Q0HwMact" }, { "cell_type": "code", "metadata": { "id": "2joI4vraNJqQ" }, "source": [ "import os\n", "import cv2 # OpenCV\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ], "id": "2joI4vraNJqQ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "67BZ00NcN_IQ" }, "source": [ "Let's set the path of the image directory" ], "id": "67BZ00NcN_IQ" }, { "cell_type": "code", "metadata": { "id": "cise8o8YOIgD" }, "source": [ "image_directory = 'images/'" ], "id": "cise8o8YOIgD", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cegZOncY2SD0" }, "source": [ "Let's create a method to plot the images" ], "id": "cegZOncY2SD0" }, { "cell_type": "code", "metadata": { "id": "FG-Q_0r3WDJ4" }, "source": [ "# Method to plot the images\n", "def plot_images(original, result=None):\n", " \"\"\"Plot the images using matplotlib libray\"\"\"\n", " # Plot single image\n", " if result is None:\n", " plt.imshow(cv2.cvtColor(original, cv2.COLOR_BGR2RGB))\n", " plt.show()\n", " return\n", " \n", " #Plot two images\n", " f, (axis1, axis2) = plt.subplots(1, 2, figsize=(15,15))\n", " axis1.imshow(cv2.cvtColor(original, cv2.COLOR_BGR2RGB))\n", " axis1.set_title('Origianl Image')\n", " axis2.imshow(cv2.cvtColor(result, cv2.COLOR_BGR2RGB))\n", " axis2.set_title('Resultant Image')\n", " return" ], "id": "FG-Q_0r3WDJ4", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "aR1FDKJkJnzH" }, "source": [ "### **1. Changing colorspaces**" ], "id": "aR1FDKJkJnzH" }, { "cell_type": "markdown", "metadata": { "id": "yQinvt3tJ16G" }, "source": [ "There are more than 150 color-space conversion methods available in OpenCV. We will take a look at two of the widely used methods - \n", "
BGR ↔ Gray and BGR ↔ HSV\n", "\n", "**BGR** - Blue-Green-Red
\n", "**HSV** - Hue- Saturation-Lightness" ], "id": "yQinvt3tJ16G" }, { "cell_type": "markdown", "metadata": { "id": "VVxTYuJ1Kvfk" }, "source": [ "For color conversion, we can make use of the function `cv.cvtColor(input_image, flag)` where `flag` determines the type of conversion.\n", "\n", "Here, we will try to extract a blue colored object from an image by following the below steps\n", "\n", "1. Read the image\n", "2. Convert from BGR to HSV color-space\n", "3. Threshold the HSV image for a range of blue color" ], "id": "VVxTYuJ1Kvfk" }, { "cell_type": "code", "metadata": { "id": "eCmqBwdeHiKS" }, "source": [ "# Read the input image\n", "image = cv2.imread('images/color_spaces.jpg', cv2.IMREAD_COLOR)" ], "id": "eCmqBwdeHiKS", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "hIvEX83wQn3g" }, "source": [ "# Plot the image using matplotlib\n", "plot_images(image)" ], "id": "hIvEX83wQn3g", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ui_Yx5qDQwyw" }, "source": [ " # Convert BGR to HSV\n", "hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)" ], "id": "ui_Yx5qDQwyw", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "v-p-EBOATUHI" }, "source": [ "# define range of blue color in HSV\n", "lower_blue = np.array([110,50,50])\n", "upper_blue = np.array([130,255,255])\n", "\n", "# Threshold the HSV image to get only blue colors\n", "mask = cv2.inRange(hsv, lower_blue, upper_blue)" ], "id": "v-p-EBOATUHI", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "AjcuweUCThCY" }, "source": [ "# Bitwise-AND mask and original image\n", "res = cv2.bitwise_and(image, image, mask=mask)" ], "id": "AjcuweUCThCY", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2YHiXH6qTpm4" }, "source": [ "# Plot the result image using matplotlib\n", "plot_images(image, res)" ], "id": "2YHiXH6qTpm4", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LR1c48AUYlFE" }, "source": [ "\n", "\n", "---\n", "\n", "\n" ], "id": "LR1c48AUYlFE" }, { "cell_type": "markdown", "metadata": { "id": "AeXk6NRHYngp" }, "source": [ "### **2. Geometric Transformations of Images**" ], "id": "AeXk6NRHYngp" }, { "cell_type": "markdown", "metadata": { "id": "ZSCbsLzKZBjY" }, "source": [ "#### **1. Scaling**\n", "\n", "Scaling is just resizing of the image. OpenCV comes with a function `cv.resize()` for this purpose. \n", "\n", "Preferable interpolation methods are \n", "1. `cv.INTER_AREA` for shrinking \n", "2. `cv.INTER_CUBIC` (slow) \n", "3. `cv.INTER_LINEAR` for zooming\n", "\n", "By default `cv.INTER_LINEAR` is used for all resizing purposes\n", "\n", "Let's try resizing an image;" ], "id": "ZSCbsLzKZBjY" }, { "cell_type": "code", "metadata": { "id": "6EsQQdjfYvNY" }, "source": [ "# Read the image\n", "image = cv2.imread('images/color_spaces.jpg', cv2.IMREAD_COLOR)\n", "\n", "# Plot the image using matplotlib\n", "plot_images(image)" ], "id": "6EsQQdjfYvNY", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "U6xV7EIKlH18" }, "source": [ "# Apply resize with `INTER_CUBIC` Interpolation - 10 times the size of input image\n", "res = cv2.resize(image, None, fx=10, fy=10, interpolation = cv2.INTER_CUBIC)" ], "id": "U6xV7EIKlH18", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "kwaEaET0mVaS" }, "source": [ "print(f'Original size: {image.shape} \\nResized image size: {res.shape}')\n", "plot_images(res)" ], "id": "kwaEaET0mVaS", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Wag6HX0vnlqi" }, "source": [ "#### **2. Translation**\n", "\n", "Translation is the shifting of an image along the x- and y-axis. Using translation, we can shift an image up, down, left, or right, along with any combination of the above.\n", "\n", "If you want to rotate an image `tx` pixels along x-axis and `ty` pixels along y-axis, the tranformation matrix will be
`[1, 0, tx][0, 1, ty]`\n", "\n", "Translation can be done by applying this transformation metrics to the method `cv2.warpAffine()`.\n" ], "id": "Wag6HX0vnlqi" }, { "cell_type": "code", "metadata": { "id": "mVDZ4HV_q2XD" }, "source": [ "# Read the image\n", "image = cv2.imread('images/color_spaces.jpg', cv2.IMREAD_COLOR)\n", "\n", "# Plot the image using matplotlib\n", "plot_images(image)" ], "id": "mVDZ4HV_q2XD", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8m9ytX86mbEj" }, "source": [ "# Shift the image 125 pixels to the right and 150 pixels down\n", "M = np.float32([[1, 0, 125], [0, 1, 150]])\n", "shifted = cv2.warpAffine(image, M, (image.shape[1], image.shape[0]))\n", "\n", "# plot the image\n", "plot_images(image, shifted)" ], "id": "8m9ytX86mbEj", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "mISRbIifuZBD" }, "source": [ "# Shift the image 150 pixels to the left and 190 pixels up\n", "M = np.float32([[1, 0, -150], [0, 1, -190]])\n", "shifted = cv2.warpAffine(image, M, (image.shape[1], image.shape[0]))\n", "\n", "# plot the image\n", "plot_images(image, shifted)" ], "id": "mISRbIifuZBD", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tvAfI10gvf4E" }, "source": [ "#### **3. Rotation**\n", "\n", "OpenCV provides a function, `cv.getRotationMatrix2D` to create a transformation matrix, which can be applied to the image using the method `cv.warpAffine()`" ], "id": "tvAfI10gvf4E" }, { "cell_type": "code", "metadata": { "id": "JZ48bvPXvaec" }, "source": [ "# Read the image\n", "image = cv2.imread('images/color_spaces.jpg', cv2.IMREAD_COLOR)\n", "\n", "# Plot the image using matplotlib\n", "plot_images(image)" ], "id": "JZ48bvPXvaec", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "v3tARNaWunLE" }, "source": [ "# grab the dimensions of the image and calculate the center of the image\n", "(height, width) = image.shape[:2]\n", "(cX, cY) = (width // 2, height // 2)" ], "id": "v3tARNaWunLE", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "cLHX0CXZwJ57" }, "source": [ "# rotate our image by 45 degrees around the center of the image\n", "M = cv2.getRotationMatrix2D((cX, cY), 45, 1.0)\n", "rotated = cv2.warpAffine(image, M, (width, height))\n", "plot_images(image, rotated)" ], "id": "cLHX0CXZwJ57", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "fcsEu_AMwMwT" }, "source": [ "# rotate our image by -90 degrees around the image\n", "M = cv2.getRotationMatrix2D((cX, cY), -90, 1.0)\n", "rotated = cv2.warpAffine(image, M, (width, height))\n", "plot_images(image, rotated)" ], "id": "fcsEu_AMwMwT", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "RK-ehYTRwxFT" }, "source": [ "#### **5. Affine Transformation**\n", "\n", "To be added" ], "id": "RK-ehYTRwxFT" }, { "cell_type": "markdown", "metadata": { "id": "YC3-EQWxxRUr" }, "source": [ "#### **6. Perspective Transformation**\n", "\n", "To be added" ], "id": "YC3-EQWxxRUr" }, { "cell_type": "markdown", "metadata": { "id": "gRcQ6J7Rxjoj" }, "source": [ "### **3. Image Thresholding**" ], "id": "gRcQ6J7Rxjoj" }, { "cell_type": "markdown", "metadata": { "id": "B4O4i3L8yXMj" }, "source": [ "#### **1. Simple Thresholding**\n", "\n", "For every pixel, a threshold value is applied. If the pixel value is smaller than the threshold, it is set to 0, otherwise it is set to a maximum value.\n", "\n", "The function `cv.threshold` is used to apply the thresholding.\n", "\n", "OpenCV provides different types of thresholding \n", "1. `cv.THRESH_BINARY`\n", "2. `cv.THRESH_BINARY_INV`\n", "3. `cv.THRESH_TRUNC`\n", "4. `cv.THRESH_TOZERO`\n", "5. `cv.THRESH_TOZERO_INV`\n", "\n", "Detailed explanation about this different methods can be found [here](https://docs.opencv.org/4.5.2/d7/d1b/group__imgproc__misc.html#gaa9e58d2860d4afa658ef70a9b1115576).\n", "\n", "*Note: The input image must be a gray scale image*" ], "id": "B4O4i3L8yXMj" }, { "cell_type": "code", "metadata": { "id": "BGdeYveCzGyE" }, "source": [ "# read the input image\n", "image = cv2.imread('images/simple_threshold.PNG',0)\n", "\n", "# Plot the images\n", "plot_images(image)" ], "id": "BGdeYveCzGyE", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "FtNiFViYyWiS" }, "source": [ "# Apply various thershold methods\n", "ret,thresh1 = cv2.threshold(image,127,255,cv2.THRESH_BINARY)\n", "ret,thresh2 = cv2.threshold(image,127,255,cv2.THRESH_BINARY_INV)\n", "ret,thresh3 = cv2.threshold(image,127,255,cv2.THRESH_TRUNC)\n", "ret,thresh4 = cv2.threshold(image,127,255,cv2.THRESH_TOZERO)\n", "ret,thresh5 = cv2.threshold(image,127,255,cv2.THRESH_TOZERO_INV)" ], "id": "FtNiFViYyWiS", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "aXIAPqfEzTNw" }, "source": [ "titles = ['Original Image','BINARY','BINARY_INV','TRUNC','TOZERO','TOZERO_INV']\n", "images = [image, thresh1, thresh2, thresh3, thresh4, thresh5]\n", "for i in range(6):\n", " plt.subplot(2,3,i+1),plt.imshow(images[i],'gray',vmin=0,vmax=255)\n", " plt.title(titles[i])\n", " plt.xticks([]),plt.yticks([])\n", "plt.show()" ], "id": "aXIAPqfEzTNw", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "c_-OPy3r1Aft" }, "source": [ "#### **2. Adaptive Thresholding**\n", "\n", "Using one global value for thresholding might not be good in all cases eg. if the image has different lighting conditions in different areas. \n", "\n", "**Adaptive Thresholding** determines the threshold for a pixel based on a small region around it. This can produce different thresholds for different regions of the same image which gives better results for image with varying illuminations. The method `cv.adaptiveThreshold` is used for it.\n", "\n", "Two different methods of calculating the threshold\n", "1. `cv.ADAPTIVE_THRESH_MEAN_C`: The threshold value is the mean of the neighbourhood area minus the constant C.\n", "2. `cv.ADAPTIVE_THRESH_GAUSSIAN_C`: The threshold value is a gaussian-weighted sum of the neighbourhood values minus the constant C.\n", "\n", "The `blockSize` determines the size of the neighbourhood area and `C` is a constant that is subtracted from the mean or weighted sum of the neighbourhood pixels." ], "id": "c_-OPy3r1Aft" }, { "cell_type": "code", "metadata": { "id": "pZzt2JkhzivS" }, "source": [ "# Read the image\n", "image = cv2.imread(f'{image_directory}calender.jpg',0)\n", "\n", "# plot the image\n", "plot_images(image)" ], "id": "pZzt2JkhzivS", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "wtzCMk954q4q" }, "source": [ "# Apply the thresholding methods to the image\n", "ret,thresh1 = cv2.threshold(image,127,255,cv2.THRESH_BINARY)\n", "thresh2 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_MEAN_C,\\\n", " cv2.THRESH_BINARY,11,2)\n", "thresh3 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\\\n", " cv2.THRESH_BINARY,11,2)" ], "id": "wtzCMk954q4q", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "hdfC66Yo-hL9" }, "source": [ "# Plot the thresholding resultant images\n", "titles = ['Original Image', 'Global Thresholding (v = 127)',\n", " 'Adaptive Mean Thresholding', 'Adaptive Gaussian Thresholding']\n", "\n", "plot_images(image, thresh1)\n", "images = [image, thresh1, thresh2, thresh3]\n", "for i in range(4):\n", " plt.subplot(2,2,i+1),plt.imshow(images[i],'gray')\n", " plt.title(titles[i])\n", " plt.xticks([]),plt.yticks([])\n", "plt.show()" ], "id": "hdfC66Yo-hL9", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OWb5aFI7DQXS" }, "source": [ "#### **3. Otsu's Binarization**\n", "\n", "Otsu's method avoids having to choose an arbitrary value as threshold and determines it automatically.\n", "\n", "Determines an optimal global threshold value from the image histogram.\n", "\n", "The ` cv.threshold()` function is used, where `cv.THRESH_OTSU` is passed as an extra flag." ], "id": "OWb5aFI7DQXS" }, { "cell_type": "code", "metadata": { "id": "b27poyFyCxNE" }, "source": [ "# Read input image\n", "image = cv2.imread(f'{image_directory}periodic_table.png',0)\n", "\n", "# Otsu's thresholding\n", "ret2,thresh = cv2.threshold(image,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)\n", "plot_images(image, thresh)" ], "id": "b27poyFyCxNE", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "6sMhW78yG7I8" }, "source": [ "### **4. Smoothing Images**" ], "id": "6sMhW78yG7I8" }, { "cell_type": "markdown", "metadata": { "id": "OWRwRNEDHQSW" }, "source": [ "#### **1. 2D Convolution ( Image Filtering )**\n", "\n", "As in one-dimensional signals, images also can be filtered with various low-pass filters (LPF), high-pass filters (HPF), etc. LPF helps in removing noise, blurring images, etc. HPF filters help in finding edges in images.\n", "\n", "The function `cv.filter2D()` is used to convolve a kernel with an image." ], "id": "OWRwRNEDHQSW" }, { "cell_type": "markdown", "metadata": { "id": "CzKBwBH2IO_G" }, "source": [ "For example, lets try applying the 5x5 averaging filter kernel below" ], "id": "CzKBwBH2IO_G" }, { "cell_type": "code", "metadata": { "id": "LACqXPGjEpHs" }, "source": [ "kernel = np.ones((5,5),np.float32)/25\n", "print(kernel)" ], "id": "LACqXPGjEpHs", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5WpIuMRbJohN" }, "source": [ "The operation works like this: keep this kernel above a pixel, add all the 25 pixels below this kernel, take the average, and replace the central pixel with the new average value" ], "id": "5WpIuMRbJohN" }, { "cell_type": "code", "metadata": { "id": "3eHkxCMDIeId" }, "source": [ "# Read input image\n", "image = cv2.imread(f'{image_directory}python_logo.png')\n", "\n", "# Plot the image\n", "plot_images(image)" ], "id": "3eHkxCMDIeId", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "YKpdEB3WIyBX" }, "source": [ "# Apply the kernel\n", "result = cv2.filter2D(image,-1,kernel)\n", "plot_images(image, result)" ], "id": "YKpdEB3WIyBX", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qhz6PWPNJ8kc" }, "source": [ "#### **2. Image Blurring (Image Smoothing)**\n", "\n", "Image blurring is achieved by convolving the image with a low-pass filter kernel. It is useful for removing noise.\n", "\n", "OpenCV provides four different type of blurring techniques\n", "\n", "\n", "\n", "\n", "\n" ], "id": "qhz6PWPNJ8kc" }, { "cell_type": "markdown", "metadata": { "id": "PXSe1A4vOieD" }, "source": [ "##### **1. Averaging**\n", "This is done by convolving an image with a normalized box filter. (as explained above in 2D convolution section)\n", "The function `cv.blur()` can be used for averaging." ], "id": "PXSe1A4vOieD" }, { "cell_type": "code", "metadata": { "id": "Moo9vqplL_vV" }, "source": [ "# apply the averaging kernel\n", "blur = cv2.blur(image,(5,5))" ], "id": "Moo9vqplL_vV", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ydmdwQvVMHqu" }, "source": [ "# plot the image\n", "plot_images(image, blur)" ], "id": "ydmdwQvVMHqu", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "gcpPyTI0L1NT" }, "source": [ "##### **2. Gaussian Blurring**\n", "In this method, instead of a box filter, a Gaussian kernel is used. The function `cv.GaussianBlur()` is used.
\n", "Gaussian blurring is highly effective in removing Gaussian noise from an image." ], "id": "gcpPyTI0L1NT" }, { "cell_type": "code", "metadata": { "id": "3ZJ9_AFnMaoP" }, "source": [ "# Apply gaussian blur\n", "blur = cv2.GaussianBlur(image,(5,5),0)\n", "\n", "# Plot the image\n", "plot_images(image, blur)" ], "id": "3ZJ9_AFnMaoP", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5HYtf65EL3kj" }, "source": [ "##### **3. Median Blurring**\n", "The function `cv.medianBlur()` takes the median of all the pixels under the kernel area and the central element is replaced with this median value.
\n", "Effective against removing salt and pepper noises." ], "id": "5HYtf65EL3kj" }, { "cell_type": "code", "metadata": { "id": "5F_uOReDJDHN" }, "source": [ "# Add the Gaussian noise to the image - salt and pepper noise\n", "\n", "# Generate Gaussian noise\n", "gauss = np.random.normal(0,1,image.size)\n", "gauss = gauss.reshape(image.shape[0],image.shape[1],image.shape[2]).astype('uint8')\n", "\n", "# add generated noise to the image\n", "image_gauss = cv2.add(image,gauss)" ], "id": "5F_uOReDJDHN", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "gr9T4X_vNQ-w" }, "source": [ "# Apply kernel to the image\n", "median = cv2.medianBlur(image_gauss,5)\n", "\n", "#plot the image\n", "plot_images(image_gauss, median)" ], "id": "gr9T4X_vNQ-w", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IlDqaJ-iO7o9" }, "source": [ "##### **4. Bilateral Filtering**\n", "\n", "`cv.bilateralFilter()` is highly effective in noise removal while keeping edges sharp.\n", "\n", "This filter preserve edges by considering Gaussian function of intensity difference that makes sure that only those pixels with similar intensities to the central pixel are considered for blurring. " ], "id": "IlDqaJ-iO7o9" }, { "cell_type": "code", "metadata": { "id": "cPJg3E0HQ_2W" }, "source": [ "# Load the input image\n", "image = cv2.imread(f'{image_directory}pattern.PNG')\n", "\n", "#plot the images\n", "plot_images(image)" ], "id": "cPJg3E0HQ_2W", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "_8MAOYvVOHVY" }, "source": [ "# Apply the filer\n", "blur = cv2.bilateralFilter(image,25,75,75)\n", "\n", "#plot the images\n", "plot_images(image, blur)" ], "id": "_8MAOYvVOHVY", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "yxmcLXpSS6jX" }, "source": [ "### **5. Morphological Transformations**\n", "\n", "Morpholofical operations are simple operatins(kernels) performed on a binary image based on its shape." ], "id": "yxmcLXpSS6jX" }, { "cell_type": "code", "metadata": { "id": "0XlPJ_8qTG1N" }, "source": [ "# read the input image\n", "image = cv2.imread(f'{image_directory}morphology.PNG')\n", "\n", "# Plot the image\n", "plot_images(image)" ], "id": "0XlPJ_8qTG1N", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nX3mQTyDWHyw" }, "source": [ "# create a 5x5 kernel\n", "kernel = np.ones((5,5),np.uint8)" ], "id": "nX3mQTyDWHyw", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Kn9P0jKzTrNY" }, "source": [ "#### **1. Erosion**\n", "\n", "Erodes the bounderies of a forground object.\n", "\n", "Operaion of the erosion is as follows, a kernel slides through the image and if the pixel under the kernel is not 1, then its eroded(zero). Basically erodes the bounderies of an object.\n", "\n", "It is useful for removing small white noises, detach two connected object etc." ], "id": "Kn9P0jKzTrNY" }, { "cell_type": "code", "metadata": { "id": "CptF1LwsVGPP" }, "source": [ "# Apply the kenel to the input image\n", "erosion = cv2.erode(image,kernel,iterations = 1)\n", "\n", "# Plot the result image\n", "plot_images(image, erosion)" ], "id": "CptF1LwsVGPP", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "2bydg1V3VnCo" }, "source": [ "#### **2. Dilation**\n", "\n", "Dilation is opposite of erosion. A pixel element is 1, if atleast one pixel value under a kernel is 1.\n" ], "id": "2bydg1V3VnCo" }, { "cell_type": "code", "metadata": { "id": "7KsZyiOVVaH2" }, "source": [ "# apply the kernel\n", "dilation = cv2.dilate(image, kernel,iterations = 1)\n", "\n", "# Plot the images\n", "plot_images(image, dilation)" ], "id": "7KsZyiOVVaH2", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "75eglnd8WXE-" }, "source": [ "#### **3. Opening**\n", "\n", "Opening is just another name of erosion followed by dilation. \n", "Method used `cv.morphologyEx()`\n", "\n", "Userful for removing whitenoise." ], "id": "75eglnd8WXE-" }, { "cell_type": "code", "metadata": { "id": "_Zp1_6NKWhd8" }, "source": [ "# Add the Gaussian noise to the image - salt and pepper noise\n", "\n", "# Generate Gaussian noise\n", "gauss = np.random.normal(0,0.5,image.size)\n", "gauss = gauss.reshape(image.shape[0],image.shape[1],image.shape[2]).astype('uint8')\n", "\n", "# add generated noise to the image\n", "image_gauss = cv2.add(image,gauss)" ], "id": "_Zp1_6NKWhd8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "zXFxuq9IWO8G" }, "source": [ "# Apply the kernel\n", "opening = cv2.morphologyEx(image_gauss, cv2.MORPH_OPEN, kernel)\n", "\n", "# plot the image\n", "plot_images(image_gauss, opening)" ], "id": "zXFxuq9IWO8G", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "NXjuZgVsXIk3" }, "source": [ "#### **4. Closing**\n", "\n", "Closing is reverse of Opening, Dilation followed by Erosion. \n", "
\n", "It is useful in closing small holes inside the foreground objects, or small black points on the object." ], "id": "NXjuZgVsXIk3" }, { "cell_type": "code", "metadata": { "id": "rXP3wQyNW7J5" }, "source": [ "# read the input image\n", "image = cv2.imread(f'{image_directory}morphology_closing.PNG')\n", "\n", "# apply the kernel\n", "closing = cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel)\n", "\n", "# Plot the images\n", "plot_images(image, closing)" ], "id": "rXP3wQyNW7J5", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TuH8jS3PZcDf" }, "source": [ "#### **5. Morphological Gradient**\n", "It is the difference between dilation and erosion of an image." ], "id": "TuH8jS3PZcDf" }, { "cell_type": "code", "metadata": { "id": "LUC0FIAFXlsc" }, "source": [ "# read the image\n", "image = cv2.imread(f'{image_directory}morphology.PNG')\n", "\n", "# Apply the kernel to the image\n", "gradient = cv2.morphologyEx(image, cv2.MORPH_GRADIENT, kernel)\n", "\n", "# Plot the images\n", "plot_images(image, gradient)" ], "id": "LUC0FIAFXlsc", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "s_I3dDsLZ-Jp" }, "source": [ "### **6. Canny Edge Detection**\n", "\n", "Canny Edge Detection is a popular edge detection algorithm.\n", "It is a multistage alogorithm,\n", "1. Noise Reduction with a 5x5 Gaussian filter\n", "2. Finding Intensity Gradient of the Image \n", "3. Non-maximum Suppression to remove pixels which may not constitute the edges\n", "4. Hysteresis Thresholding to discard found edges based a threshold." ], "id": "s_I3dDsLZ-Jp" }, { "cell_type": "markdown", "metadata": { "id": "7-gSLaJEjYUd" }, "source": [ "`cv2.Canny()` method is used for canny edge detection" ], "id": "7-gSLaJEjYUd" }, { "cell_type": "code", "metadata": { "id": "unupDFopZz2F" }, "source": [ "# Read the input image\n", "image = cv2.imread(f'{image_directory}stop_sign.jpg')\n", "\n", "# Apply the canny edge detection\n", "edges = cv2.Canny(image,50,200)\n", "\n", "# Plot the images\n", "plot_images(image, edges)" ], "id": "unupDFopZz2F", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tuErXYIgnpOK" }, "source": [ "### **7. Image Pyramids**\n", "\n", "In case of working with images where we need to create a set of the same image with different resoultion, we can use Image Pyramids. At each layer of the pyramid the image is downsized or smoothed.\n", "\n", "
\n", "
\n", "\n", "There are two types of image pyramids\n", "1. Gaussian Pyramid\n", "2. Laplacian Pyramids\n", "\n", "We can find Gaussian pyramids using `cv.pyrDown()` and `cv.pyrUp()` functions.\n", "\n", "Laplacian Pyramids are formed from the Gaussian Pyramids.\n" ], "id": "tuErXYIgnpOK" }, { "cell_type": "code", "metadata": { "id": "sp9LUN2Yl5O8" }, "source": [ "# Read input image\n", "image = cv2.imread(f'{image_directory}stop_sign.jpg')\n", "layer = image.copy()\n", "\n", "# Plot the pyramid\n", "for i in range(4):\n", " layer = cv2.pyrDown(layer)\n", " plot_images(layer)" ], "id": "sp9LUN2Yl5O8", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "6HqwlF6qdbmV" }, "source": [ "### **8. Contours**\n", "\n", "Contours are simply as a curve joining all the continuous points (along the boundary), having same color or intensity. Useful tool for shape analysis and object detection and recognition.\n", "\n", "OpenCV provides methods `cv.findContours()`, `cv.drawContours()` for the same.\n", "\n", "In OpenCV, finding contours is like finding white object from black background.\n", "\n" ], "id": "6HqwlF6qdbmV" }, { "cell_type": "code", "metadata": { "id": "Cd3wVoYydbGj" }, "source": [ "# Read the image\n", "image = cv2.imread(f'{image_directory}stop_sign.jpg')\n", "\n", "org_image = image.copy()" ], "id": "Cd3wVoYydbGj", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "XZFr0Qotaq-t" }, "source": [ "# Convert the image from color to gray\n", "imgray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n", "\n", "# Applying threshold\n", "ret, thresh = cv2.threshold(imgray, 127, 255, 0)" ], "id": "XZFr0Qotaq-t", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "D7VYK9Nvf2UP" }, "source": [ "# Find contours on the image\n", "contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)\n", "\n", "# Draw all the contours\n", "contour_img = cv2.drawContours(image, contours, -1, (0,255,0), 3)\n", "\n", "# Plot the image\n", "plot_images(org_image, contour_img)" ], "id": "D7VYK9Nvf2UP", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tKQL7dp6gU2S" }, "source": [ "### **9. Histogram**\n", "\n", "You can consider histogram as a graph or plot, which gives you an overall idea about the intensity distribution of an image. It is a plot with pixel values (ranging from 0 to 255, not always) in X-axis and corresponding number of pixels in the image on Y-axis.\n", "\n", "
\n", "
\n", "\n", "`cv.calcHist()` is used for calculing the histogram for a image." ], "id": "tKQL7dp6gU2S" }, { "cell_type": "markdown", "metadata": { "id": "HJXD_TrdhtPO" }, "source": [ "Here, let us use the matplotlib library method to calculate the plot the histogram" ], "id": "HJXD_TrdhtPO" }, { "cell_type": "code", "metadata": { "id": "Tkww9ZS2f7EQ" }, "source": [ "# Read the input image\n", "img = cv2.imread(f'{image_directory}stop_sign.jpg',0)\n", "\n", "# plot the image\n", "plot_images(img)\n", "\n", "# Calculate histogram and plot \n", "plt.hist(img.ravel(),256,[0,256]); plt.show()" ], "id": "Tkww9ZS2f7EQ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cooslghpjK5D" }, "source": [ "\n", "\n", "---\n", "\n" ], "id": "cooslghpjK5D" }, { "cell_type": "code", "metadata": { "id": "-Rtijwd4M9fx" }, "source": [ "" ], "id": "-Rtijwd4M9fx", "execution_count": null, "outputs": [] } ] }