{"id":38927,"date":"2022-12-29T13:15:41","date_gmt":"2022-12-29T13:15:41","guid":{"rendered":"https:\/\/www.askpython.com\/?p=38927"},"modified":"2023-05-12T04:17:12","modified_gmt":"2023-05-12T04:17:12","slug":"combine-csv-files-using-python","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python-modules\/pandas\/combine-csv-files-using-python","title":{"rendered":"How to combine CSV files using Python?"},"content":{"rendered":"\n<p>Often while working with CSV files, we need to deal with large datasets. Depending on the requirements of the data analysis, we may find that all the required data is not present in a single CSV file. Then the need arises to merge multiple files to get the desired data. However, copy-pasting the required columns from one file to another and that too from large datasets is not the best way to around it. <\/p>\n\n\n\n<p>To solve this problem, we will learn how to use the <code>append<\/code>, <code>merge<\/code> and <code>concat<\/code> methods from Pandas to combine CSV files.  <\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Combining Multiple CSV Files together<\/h2>\n\n\n\n<p>To begin with, let&#8217;s create sample CSV files that we will be using.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"277\" height=\"122\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/csv-file-1.png\" alt=\"Csv File 1\" class=\"wp-image-38928\"\/><figcaption class=\"wp-element-caption\">CSV File 1<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"278\" height=\"122\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/csv-file-2.png\" alt=\"Csv File 2\" class=\"wp-image-38929\"\/><figcaption class=\"wp-element-caption\">CSV File 2<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"276\" height=\"122\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/csv-file-3.png\" alt=\"Csv File 3\" class=\"wp-image-38930\"\/><figcaption class=\"wp-element-caption\">CSV File 3<\/figcaption><\/figure>\n\n\n\n<p>Notice that, all three files have the same columns or headers i.e. <strong>&#8216;name&#8217;<\/strong>, <strong>&#8216;age&#8217;<\/strong> and <strong>&#8216;score&#8217;<\/strong>. Also, file 1 and file 3 have a common entry for the <strong>&#8216;name&#8217;<\/strong> column which is <em>Sam<\/em>, but the rest of the values are different in these files.<\/p>\n\n\n\n<p>Note that, in the below examples we are considering that all the CSV files are in the same folder as your Python code file. If this is not the case for you, please specify the paths accordingly while trying out the examples by yourself. <br>All the examples were executed in a Jupyter notebook.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Different Ways to Combine CSV Files in Python<\/h2>\n\n\n\n<p>Before starting, we will be creating a list of the CSV files that will be used in the examples below as follows:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport glob\n\n# list all csv files only\ncsv_files = glob.glob(&#039;*.{}&#039;.format(&#039;csv&#039;))\ncsv_files\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n&#x5B;&#039;csv_file_1.csv&#039;, &#039;csv_file_2.csv&#039;, &#039;csv_file_3.csv&#039;]\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">Method 1: append()<\/h3>\n\n\n\n<p>Let&#8217;s look at the <a href=\"https:\/\/www.askpython.com\/python\/array\/append-an-array-in-python\" data-type=\"post\" data-id=\"6600\">append method<\/a> here to merge the three CSV files.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\n\ndf_csv_append = pd.DataFrame()\n\n# append the CSV files\nfor file in csv_files:\n    df = pd.read_csv(file)\n    df_csv_append = df_csv_append.append(df, ignore_index=True)\n\ndf_csv_append\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"181\" height=\"235\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/output-append-and-concat.png\" alt=\"Output Append And Concat\" class=\"wp-image-38931\"\/><\/figure>\n\n\n\n<p>The <code>append<\/code> method, as the name suggests, appends each file&#8217;s data frame to the end of the previous one.  In the above code, we first create a data frame to store the result named <strong>df_csv_append<\/strong>. Then, we iterate through the list and read each CSV file and append it to the data frame <strong>df_csv_append<\/strong>. <\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Method 2: concat()<\/h3>\n\n\n\n<p>Another method used to combine CSV files is the <a href=\"https:\/\/www.askpython.com\/python\/string\/string-concatenation-in-python\" data-type=\"post\" data-id=\"1090\">Pandas <code>concat()<\/code> method<\/a>. This method requires a series of objects as a parameter, hence we first create a series of the data frame objects of each CSV file and then apply the <code>concat()<\/code> method to it.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\n\ndf_csv_concat = pd.concat(&#x5B;pd.read_csv(file) for file in csv_files ], ignore_index=True)\ndf_csv_concat\n<\/pre><\/div>\n\n\n<p>An easier-to-understand way of writing this code is:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nl = &#x5B;]\n\nfor f in csv_files:\n    l.append(pd.read_csv(f))\n    \ndf_res = pd.concat(l, ignore_index=True)\ndf_res\n<\/pre><\/div>\n\n\n<p>Both the above codes when executed produce the same output as shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"181\" height=\"235\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/output-append-and-concat-1.png\" alt=\"Output Append And Concat 1\" class=\"wp-image-38932\"\/><\/figure>\n\n\n\n<p>Notice that the resulting data frame is the same as that of the <code>append()<\/code> method.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Method 3: merge()<\/h3>\n\n\n\n<p>The <code>merge<\/code> method is used to join very large data frames. A join can be performed on two data frames at a time. We can specify the key based on which the join is to be performed. <\/p>\n\n\n\n<p>It is a good practice to choose a key that is unique for each entry in the data frame, in order to avoid duplication of rows. We can also specify the type of join we wish to perform i.e. either of <em>&#8216;inner&#8217;<\/em>, <em>&#8216;outer&#8217;<\/em>, <em>&#8216;left&#8217;<\/em>, <em>&#8216;right&#8217;<\/em> or <em>&#8216;cross&#8217; <\/em>join.<\/p>\n\n\n\n<p>We need to first <a href=\"https:\/\/www.askpython.com\/python-modules\/pandas\/pandas-read-csv-with-headers\" data-type=\"post\" data-id=\"34139\">read each CSV file<\/a> into a separate data frame.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\n\ndf1 = pd.read_csv(&#039;csv_file_1.csv&#039;)\ndf2 = pd.read_csv(&#039;csv_file_2.csv&#039;)\ndf3 = pd.read_csv(&#039;csv_file_3.csv&#039;)\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"202\" height=\"160\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/df1.png\" alt=\"Df1\" class=\"wp-image-38933\"\/><figcaption class=\"wp-element-caption\">df1<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"221\" height=\"172\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/df2.png\" alt=\"Df2\" class=\"wp-image-38934\"\/><figcaption class=\"wp-element-caption\">df2<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"198\" height=\"175\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/df3.png\" alt=\"Df3\" class=\"wp-image-38935\"\/><figcaption class=\"wp-element-caption\">df3<\/figcaption><\/figure>\n\n\n\n<p><strong>Joining df1 and df2:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndf_merged = df1.merge(df2, how=&#039;outer&#039;)\ndf_merged\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"202\" height=\"187\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/merge-output-1.png\" alt=\"Merge Output 1\" class=\"wp-image-38936\"\/><figcaption class=\"wp-element-caption\">Merge Output 1<\/figcaption><\/figure>\n\n\n\n<p>Joining df1 and df3 based on the key &#8216;name&#8217;.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndf_merged = df1.merge(df3, on=&quot;name&quot;, how=&#039;outer&#039;)\ndf_merged\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"362\" height=\"151\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/merge-output-2.png\" alt=\"Merge Output 2\" class=\"wp-image-38937\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/merge-output-2.png 362w, https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/12\/merge-output-2-300x125.png 300w\" sizes=\"auto, (max-width: 362px) 100vw, 362px\" \/><figcaption class=\"wp-element-caption\">Merge Output 2<\/figcaption><\/figure>\n\n\n\n<p>df1 and df3, both have an entry for the name &#8216;Sam&#8217; and the age and score values for both of them are different. Hence, in the resulting data frame, there are columns for representing the entries from both df1 and df3.  Since <em>John <\/em>and <em>Bob<\/em> are not common in the data frames df1 and df3, their values are NaN wherever applicable.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In this article, we learned about the Pandas methods namely <strong>concat<\/strong>, <strong>merge <\/strong>and <strong>append <\/strong>and how to use them to combine CSV files using Python.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pandas append official documentation<\/li>\n\n\n\n<li><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.concat.html\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas concat official documentation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.merge.html\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas merge official documentation<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Often while working with CSV files, we need to deal with large datasets. Depending on the requirements of the data analysis, we may find that all the required data is not present in a single CSV file. Then the need arises to merge multiple files to get the desired data. However, copy-pasting the required columns [&hellip;]<\/p>\n","protected":false},"author":46,"featured_media":38939,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[94],"tags":[],"class_list":["post-38927","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pandas"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/38927","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/46"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=38927"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/38927\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/38939"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=38927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=38927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=38927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}