Reading multiple large csv files of size 10GB plus parallel in python

Question

I have a client shared feed of 100 GB in 10 CSV files each having 10GB.

when we want to parse all files to create one final feed file, it will take more than one day to complete.

So I have done parsing multiple CSV files parallelly using python multiprocessing pool.

I have done testing for three files of size 30 GB using below code and is taking around 10 mins to complete.

Can somebody look into my code and help me to improve the below code to parse faster or suggest me any better way to parse files.

# -*- coding: UTF-8 -*-
from multiprocessing import Pool
import time
import csv
import codecs

def unicode_csv_reader(csvfile,dialect=csv.excel, **kwargs):
    with open(csvfile) as f:
        for row in csv.reader(codecs.iterencode(codecs.iterdecode(f,"utf-8"), "utf-8"),quotechar='"',delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,dialect=dialect, **kwargs):
            yield [e.decode("utf-8") for e in row]


def process_file(name):
    ''' Process one file:'''
    csv_reader=unicode_csv_reader(name)
    for row in csv_reader:
        if row is not None and len(row) != 0 and row[1]=="in stock" and row[18]=="Book":
        linePrint=row[0]+"\t"+row[6]+"\t"+row[12]+"\t"+row[4]+"\t"+row[17]+"\t"+row[17]+"\t"+row[10]+"\t"+row[9]+"\t"+"\t"+row[18]+"\t"+row[18]+"\t"+row[8]+"\t"+row[8]+"\t\t"
        print linePrint.encode("utf-8")


def process_files_parallel():
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool(processes=4)
    results=pool.map(process_file, ["t1.csv","t2.csv","t3.csv"])
    return results


if __name__ == '__main__':

    start=time.time()
    res=process_files_parallel()
    print res

I'm running this file in my ubuntu machine like below

python multiprocessfiles.py > finalfeed.csv

Sample data from client feed

"id", "availability", "condition", "description", "image_link", "link", "title", "brand", "google_product_category", "price", "sale_price", "currency", "android_url", "android_app_name", "android_package", "discount_percentage","discount_value", "category", "super_category"
"5780705772161","in stock","new","(ise) Genetics: Analysis Of Eenes And Genomics","https://rukminim1.client.com/image/600/600/jeiukcw0/book/9/8/2/medicinal-inorganic-chemistry-original-imaf37yeyhyhzwfm.jpeg?q=90","http://www.client.com/ise-genetics-analysis-eenes-genomics/p/itmd32spserbxyhf?pid=5780705772161&marketplace=client&cmpid=content_appretar_BooksMedia_Book","(ise) Genetics: Analysis Of Eenes And Genomics","W. Jones","Books","3375","1893","INR","client://fk.dl/de_wv_CL%7Csem_--_http%3A%2F%2Fwww.client.com%2Fise-genetics-analysis-eenes-genomics%2Fp%2Fitmd32spserbxyhf~q~pid%3D5780705772161%26marketplace%3Dclient_--_cmpid_--_content_appretar_BooksMedia_Book","client","com.client.android","43","1482","BooksMedia","Book"

Just to confirm, the order of the lines from the various files does not matter? — jeremye
– jeremye, Commented Aug 20, 2019 at 17:23
@illiteratecoder, Yea, for me the order of lines doesn't matter, as I want all the lines shd be in one final feed file, it doesn't matter where the product line. For ex: if line 10 from t1.csv and line 5 from t2.csv shd be in any line in the final file. — Chethu
– Chethu, Commented Aug 20, 2019 at 17:30
@Chethu Can you post a short example how your data looks like? — Sebastian Waldbauer
– Sebastian Waldbauer, Commented Aug 20, 2019 at 17:31
You may want to upgrade to Python 3 to begin with, to save on that manual UTF-8 roundtripping you're doing with the csv module. — AKX
– AKX, Commented Aug 20, 2019 at 17:36

ndclt · Accepted Answer · 2019-08-20 18:25:39Z

2

While not exactly answering your question, this should be doable in dask. It process in parallel by default. Reading multiple files in parallel is as simple as this:

import dask.dataframe as dd
df = dd.read_csv('t*.csv')

More details can be found at the provided link.

edited Aug 20, 2019 at 18:25

ndclt

3,2232 gold badges16 silver badges31 bronze badges

answered Aug 20, 2019 at 17:30

jedi

5954 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chethu Over a year ago

Thanks for the answer, but I can't install module like dask on the production machine. I want to achieve this in pure python way like above

jedi Over a year ago

Got it, but for performance reasons, I would resort to tested tools like bash: cat t*.csv > ./merged.csv combines three 10Gb files under 2 min, on cygwin. In native Linux, it would be even faster. That is still around 5 times faster than the code above.

Chethu Over a year ago

I cannot merge file like an above-using cat as I need only certain columns from the client feed and I have to validate all the basic product details shd be present for the PID and need to keep only products where the price is greater than 1000 and if not then exclude that from the final feed.

Collectives™ on Stack Overflow

Reading multiple large csv files of size 10GB plus parallel in python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related