{"id":10516,"date":"2020-11-30T21:20:17","date_gmt":"2020-11-30T21:20:17","guid":{"rendered":"https:\/\/www.askpython.com\/?p=10516"},"modified":"2022-07-13T06:10:04","modified_gmt":"2022-07-13T06:10:04","slug":"clean-csv-data-python","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/clean-csv-data-python","title":{"rendered":"How to clean CSV data in Python?"},"content":{"rendered":"\n<p>Effective implementation of Machine learning algorithms or even when doing some data analysis on a dataset, we require clean data. <\/p>\n\n\n\n<p><em>There&#8217;s a well-known saying about numerical modeling with data, &#8220;Trash in Trash out&#8221; we can&#8217;t expect decent results when our data isn&#8217;t clean. <\/em><\/p>\n\n\n\n<p>in this article, we&#8217;ll explore common techniques we can use to clean CSV data using the <a class=\"rank-math-link\" href=\"https:\/\/www.askpython.com\/python-modules\/pandas\/python-pandas-module-tutorial\">python pandas library<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">CSV Data Cleaning Checks<\/h2>\n\n\n\n<p>We&#8217;ll clean data based on the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Missing Values<\/li><li>Outliers<\/li><li>Duplicate Values<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">1. Cleaning Missing Values in CSV File<\/h2>\n\n\n\n<p>In Pandas, a missing value is usually denoted by <code>NaN<\/code> , since it is based on the <a href=\"https:\/\/www.askpython.com\/python-modules\/numpy\/python-numpy-module\" class=\"rank-math-link\">NumPy package<\/a> it is the special floating-point NaN value particular to NumPy. <\/p>\n\n\n\n<p>You can find the dataset used in this article <a aria-label=\"here (opens in a new tab)\" href=\"https:\/\/datasets.imdbws.com\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\" class=\"rank-math-link\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Finding Missing Values<\/h3>\n\n\n\n<p>Let&#8217;s first see how we can find if there&#8217;s a missing value in our data.<\/p>\n\n\n\n<p><strong>#Approach 1: visually<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport seaborn as sb\nimport matplotlib.pyplot as plt\n\n#importing Dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\n#Creating the heatmap\nplt.figure(figsize = (8,6))\nsb.heatmap(df.isnull(), cbar=False , cmap = &#039;magma&#039;)\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2400\" height=\"1800\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap.jpeg\" alt=\"Missing Values Using Heatmap\" class=\"wp-image-10573\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap.jpeg 2400w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-300x225.jpeg 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-1024x768.jpeg 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-768x576.jpeg 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-1536x1152.jpeg 1536w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-2048x1536.jpeg 2048w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-160x120.jpeg 160w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-320x240.jpeg 320w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/Missing-values-using-Heatmap-1600x1200.jpeg 1600w\" sizes=\"auto, (max-width: 2400px) 100vw, 2400px\" \/><figcaption>Missing Values Using the Heatmap<\/figcaption><\/figure>\n\n\n\n<p>The <code>isnull()<\/code> method returns boolean values indicating if there&#8217;s a missing value in the data. <\/p>\n\n\n\n<p>However, this process could be limited to only medium to small datasets.<\/p>\n\n\n\n<p><strong>#Approach 2<\/strong><\/p>\n\n\n\n<p>We can use <code>.sum()<\/code> method after applying <code>.isnull()<\/code>, this will return the sum of missing values within each column in the data frame.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\n\n#importing dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\ndf.isnull().sum()\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"560\" height=\"346\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/finding-sum-of-missing-values.jpg\" alt=\"Finding Sum Of Missing Values\" class=\"wp-image-10577\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/finding-sum-of-missing-values.jpg 560w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/finding-sum-of-missing-values-300x185.jpg 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><figcaption>Finding Sum Of Missing Values<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Cleaning Missing Values from Data<\/h3>\n\n\n\n<p>We found that our dataset does have some missing values in it, what should we do next to get clean data?<\/p>\n\n\n\n<p>We can either drop the rows and columns containing missing values in them or replace the missing values with appropriate value i.e. mean, median, or mode.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Dropping Missing Values:<\/strong><\/li><\/ol>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#imporing pandas\nimport pandas as pd\n\n#Importing dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\n#Size of original dataset\nprint(df.shape)\n\n#Dropping the missing rows.\ndf_dropped = df.dropna(how = &#039;any&#039;)\n\n<\/pre><\/div>\n\n\n<p>The above code will drop the rows from the dataframe having missing values.<\/p>\n\n\n\n<p>Let&#8217;s look at <code>.dropna()<\/code> method in detail:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>df.dropna()  &#8211; Drop all rows that have any NaN values<\/li><li>df.dropna(how=&#8217;all&#8217;)  &#8211; Drop only if ALL columns are NaN<\/li><li>df.dropna(thresh=2)  &#8211; Drop row if it does not have at least two values that are <strong>not<\/strong> NaN<\/li><li>df.dropna(subset=[1])  &#8211; Drop only if NaN in specific column<\/li><\/ul>\n\n\n\n<p>One must be careful when considering dropping the missing values as it might affect the quality of the dataset.<\/p>\n\n\n\n<p>2. <strong>Replacing Missing values<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\n\n#importing the dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\n#Creating a copy of dataframe\ndf_new = df\n\ndf_new&#x5B;&#039;Metascore&#039;] = df_new&#x5B;&#039;Metascore&#039;].fillna((df_new&#x5B;&#039;Metascore&#039;].mean()))\n\n#printing the dataframes after replacing null values\nprint(df_new.isna().sum())\nprint(df.isna().sum())\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"571\" height=\"238\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/before-and-after-filling-null-values.jpg\" alt=\"Before And After Filling Null Values\" class=\"wp-image-10586\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/before-and-after-filling-null-values.jpg 571w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/11\/before-and-after-filling-null-values-300x125.jpg 300w\" sizes=\"auto, (max-width: 571px) 100vw, 571px\" \/><figcaption>Before And After Filling Null Values<\/figcaption><\/figure>\n\n\n\n<p>Pandas module has the .fillna() method, which accepts a value that we want to replace in place of NaN values. We just calculated the mean of the column and passed it as an input argument to <code>fillna()<\/code> method.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Dealing with Outliers<\/h2>\n\n\n\n<p>Outliers can change the course of entire predictions therefore it is essential we <a href=\"https:\/\/www.askpython.com\/python\/examples\/detection-removal-outliers-in-python\" class=\"rank-math-link\">detect and remove outliers<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using Z-Score<\/strong><\/h3>\n\n\n\n<p>Let&#8217;s detect outliers in the Votes column in our dataset and filter the outliers using a z-score. <\/p>\n\n\n\n<p>The idea behind this method lies in the fact that values lying 3 <a class=\"rank-math-link\" href=\"https:\/\/www.askpython.com\/python\/examples\/mean-and-standard-deviation-python\">standard deviations<\/a> away from the mean will be termed an Outlier.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#importing required modules\nimport pandas as pd\nimport numpy as np\nfrom scipy import stats\n\n#importing dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\n#filtering outliers\ndf_new = df&#x5B;(np.abs(stats.zscore(df.Votes)) &lt; 3)]\n<\/pre><\/div>\n\n\n<p>The column on which this method is applied should be a numerical variable and not categorical. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using Quantiles<\/strong><\/h3>\n\n\n\n<p>By this method values falling below 0.01 quantile and above 0.99 quantiles in the series will be filtered out.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#importing required modules\nimport pandas as pd\nimport numpy as np\nfrom scipy import stats\n\n#importing dataset\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\n\n#Selecting limits\nq_low = df&#x5B;&quot;Votes&quot;].quantile(0.01)\nq_hi  = df&#x5B;&quot;Votes&quot;].quantile(0.99)\n\n#filtering outliers\ndf_filtered = df&#x5B;(df&#x5B;&quot;Votes&quot;] &lt; q_hi) &amp; (df&#x5B;&quot;Votes&quot;] &gt; q_low)]\n\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">3. Dealing with Duplicate entries<\/h2>\n\n\n\n<p>We can check for any duplicates in a DataFrame using <strong><code>.duplicated()<\/code><\/strong> method. This returns a Pandas Series and not a DataFrame. <\/p>\n\n\n\n<p>To check duplicate values in a specific column we can provide the column name as an input argument into the <code>.duplicated( )<\/code> method.<\/p>\n\n\n\n<p>Let&#8217;s see this in action.<\/p>\n\n\n\n<p>Luckily we have no duplicate values in our data frame, so we will append some values from the data frame itself to create duplicate values.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#Import the required modules\nimport pandas as pd\nimport numpy as np\n\n#importing and appending similar rows to of the dataframe\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\ndf1 = df.append(df.iloc&#x5B;20:30,:])\n\ndf1.duplicated().sum()\n<\/pre><\/div>\n\n\n<pre class=\"wp-block-preformatted\"><strong>Output<\/strong>:\n10<\/pre>\n\n\n\n<p>Now, <code>.drop_duplicates()<\/code> method is used to drop the duplicate values from the dataframe.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#Importing the required modules\n#Import the required modules\nimport pandas as pd\nimport numpy as np\n\n#importing and appending similar rows to of the dataframe\ndf = pd.read_csv(&#039;IMDB-Movie-Data.csv&#039;)\ndf1 = df.append(df.iloc&#x5B;20:30,:])\n\n#dropping the duplicates\ndf1 = df1.drop_duplicates()\n\n#checking the duplicates \ndf1.duplicated().sum()\n<\/pre><\/div>\n\n\n<pre class=\"wp-block-preformatted\"><strong>output<\/strong>:\n0<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>CSV data cleaning in Python is easy with pandas and the NumPy module. Always perform data cleaning before running some analysis over it to make sure the analysis is correct.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s Next?<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/www.askpython.com\/python\/examples\/data-analysis-in-python\" data-type=\"post\" data-id=\"24880\">Python Data Analysis<\/a><\/li><li><a href=\"https:\/\/www.askpython.com\/python-modules\/numpy\/python-numpy-module\" data-type=\"post\" data-id=\"7694\">NumPy module<\/a><\/li><li><a href=\"https:\/\/www.askpython.com\/python\/data-analytics-vs-data-science\" data-type=\"post\" data-id=\"24957\">Data Analytics vs Data Science<\/a><\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Resources<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener\">Pandas Official Site<\/a><\/li><li><a href=\"https:\/\/numpy.org\/doc\/stable\/\" target=\"_blank\" rel=\"noopener\">Numpy.org Documentation<\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Effective implementation of Machine learning algorithms or even when doing some data analysis on a dataset, we require clean data. There&#8217;s a well-known saying about numerical modeling with data, &#8220;Trash in Trash out&#8221; we can&#8217;t expect decent results when our data isn&#8217;t clean. in this article, we&#8217;ll explore common techniques we can use to clean [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":10991,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-10516","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/10516","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=10516"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/10516\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/10991"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=10516"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=10516"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=10516"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}