Remove a single row from a csv without copying files

Question

There are multiple SO questions addressing some form of this topic, but they all seem terribly inefficient for removing only a single row from a csv file (usually they involve copying the entire file). If I have a csv formatted like so:

fname,lname,age,sex
John,Doe,28,m
Sarah,Smith,27,f
Xavier,Moore,19,m

What is the most efficient way to remove Sarah's row? If possible, I would like to avoid copying the entire file.

Does this have to be in Python? There are better suited tools. — kabanus
– kabanus, Commented Apr 9, 2018 at 12:21
I'm open to suggestions, but would prefer to stay within python. — SamBG
– SamBG, Commented Apr 9, 2018 at 12:22
Agree with kabanus, if this is all you want to do, why not use sed/awk/grep. If you want to do something else, then iterating over the file in Python is probably necessary anyway, so the naive approach is fine. — Chris_Rands
– Chris_Rands, Commented Apr 9, 2018 at 12:32
BTW you can open() with r+ to read and write to the same file — Chris_Rands
– Chris_Rands, Commented Apr 9, 2018 at 12:32
By the way, this is why people put data like this in a database, not a CSV. Just saying... — corsiKa
– corsiKa, Commented Apr 9, 2018 at 16:10

Martin Bonner supports Monica · Accepted Answer · 2018-04-09 12:47:48Z

28

You have a fundamental problem here. No current filesystem (that I am aware of) provides a facility to remove a bunch of bytes from the middle of a file. You can overwrite existing bytes, or write a new file. So, your options are:

Create a copy of the file without the offending line, delete the old one, and rename the new file in place. (This is the option you want to avoid).
Overwrite the bytes of the line with something that will be ignored. Depending on exactly what is going to read the file, a comment character might work, or spaces might work (or possibly even \0). If you want to be completely generic though, this is not an option with CSV files, because there is no defined comment character.
As a last desperate measure, you could:
- read up to the line you want to remove
- read the rest of the file into memory
- and overwrite the line and all subsequent lines with the data you want to keep.
- truncate the file as the final position (filesystems usually allow this).

The last option obviously doesn't help much if you are trying to remove the first line (but it is handy if you want to remove a line near the end). It is also horribly vulnerable to crashing in the middle of the process.

answered Apr 9, 2018 at 12:47

Martin Bonner supports Monica

29.2k3 gold badges56 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

SamBG Over a year ago

Thanks this is very helpful. Why exactly is the last option prone to crashing? And is that the method @kabanus is employing in his answer?

Martin Bonner supports Monica Over a year ago

@SamBG It is not prone to crashing, but if it does crash during the write (because of a power cut, for example), the results will be catastrophic. And yes, it is the method that kabanus is using.

uta Over a year ago

@MartinBonner It's slyness. Most file systems distribute the file volume by clusters that are joined in a doubly linked list. Thus, it is possible to incomplete rewrite of the file during internal insertion / deletion. All databases use this functionality. However, the implementation of such an algorithm, which is stable to failures, is rather laborious. The files that are mapped to the memory some time do this work for you (if the virtual memory manager is smart enough) .

uta Over a year ago

@SamBG Please, read about docs.python.org/3.0/library/mmap.html URL has an example, that almost cover your case.

zwer · Accepted Answer · 2018-04-10 20:33:05Z

Editing files in-place is a task riddled with gotchas (much like modifying an iterable while iterating over it) and usually not worth the trouble. In most cases, writing to a temporary file (or working memory, in dependence of what you have more - storage space or RAM) then deleting the source file and replacing the source file with the temporary file will be equally performant as attempting to do the same thing in-place.

But, if you insist, here's a generalized solution:

import os

def remove_line(path, comp):
    with open(path, "r+b") as f:  # open the file in rw mode
        mod_lines = 0  # hold the overwrite offset
        while True:
            last_pos = f.tell()  # keep the last line position
            line = f.readline()  # read the next line
            if not line:  # EOF
                break
            if mod_lines:  # we've already encountered what we search for
                f.seek(last_pos - mod_lines)  # move back to the beginning of the gap
                f.write(line)  # fill the gap with the current line
                f.seek(mod_lines, os.SEEK_CUR)  # move forward til the next line start
            elif comp(line):  # search for our data
                mod_lines = len(line)  # store the offset when found to create a gap
        f.seek(last_pos - mod_lines)  # seek back the extra removed characters
        f.truncate()  # truncate the rest

This will remove only the line matching the provided comparison function and then iterate over the rest of the file shifting the data over the 'removed' line. You won't need to load the rest of the file into your working memory, either. To test it, with test.csv containing:

fname,lname,age,sex
John,Doe,28,m
Sarah,Smith,27,f
Xavier,Moore,19,m

You can run it as:

remove_line("test.csv", lambda x: x.startswith(b"Sarah"))

And you'll get test.csv with the in-place removed Sarah line:

fname,lname,age,sex
John,Doe,28,m
Xavier,Moore,19,m

Keep in mind that we're passing a bytes comparison function as the file is opened in binary mode to keep consistent line breaks while truncating/overwriting.

UPDATE: I was interested in the actual performance of various techniques presented here but I didn't have the time to test them yesterday, so with a bit of a delay I've created a benchmark that could shed some light on it. If you're interested only in the results, scroll all the way down. First I'll explain what was I benchmarking and how I set up the test. I'll also provide all the scripts so you can run the same benchmark at your system.

As for what, I've tested all of the mentioned techniques in this and other answers, namely line replacement using a temporary file (temp_file_* functions) and using an in-place editing (in_place_*) functions. I have both of those set up in a streaming (reading line by line, *_stream functions) and memory (reading the rest of the file in working memory, *_wm functions) modes. I've also added an in-place line deletion technique using the mmap module (the in_place_mmap function). The benchmarked script containing all the functions as well as a small bit of logic to be controlled through the CLI is as follows:

#!/usr/bin/env python

import mmap
import os
import shutil
import sys
import time

def get_temporary_path(path):  # use tempfile facilities in production
    folder, filename = os.path.split(path)
    return os.path.join(folder, "~$" + filename)

def temp_file_wm(path, comp):
    path_out = get_temporary_path(path)
    with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
        while True:
            line = f_in.readline()
            if not line:
                break
            if comp(line):
                f_out.write(f_in.read())
                break
            else:
                f_out.write(line)
        f_out.flush()
        os.fsync(f_out.fileno())
    shutil.move(path_out, path)

def temp_file_stream(path, comp):
    path_out = get_temporary_path(path)
    not_found = True  # a flag to stop comparison after the first match, for fairness
    with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
        while True:
            line = f_in.readline()
            if not line:
                break
            if not_found and comp(line):
                continue
            f_out.write(line)
        f_out.flush()
        os.fsync(f_out.fileno())
    shutil.move(path_out, path)

def in_place_wm(path, comp):
    with open(path, "r+b") as f:
        while True:
            last_pos = f.tell()
            line = f.readline()
            if not line:
                break
            if comp(line):
                rest = f.read()
                f.seek(last_pos)
                f.write(rest)
                break
        f.truncate()
        f.flush()
        os.fsync(f.fileno())

def in_place_stream(path, comp):
    with open(path, "r+b") as f:
        mod_lines = 0
        while True:
            last_pos = f.tell()
            line = f.readline()
            if not line:
                break
            if mod_lines:
                f.seek(last_pos - mod_lines)
                f.write(line)
                f.seek(mod_lines, os.SEEK_CUR)
            elif comp(line):
                mod_lines = len(line)
        f.seek(last_pos - mod_lines)
        f.truncate()
        f.flush()
        os.fsync(f.fileno())

def in_place_mmap(path, comp):
    with open(path, "r+b") as f:
        stream = mmap.mmap(f.fileno(), 0)
        total_size = len(stream)
        while True:
            last_pos = stream.tell()
            line = stream.readline()
            if not line:
                break
            if comp(line):
                current_pos = stream.tell()
                stream.move(last_pos, current_pos, total_size - current_pos)
                total_size -= len(line)
                break
        stream.flush()
        stream.close()
        f.truncate(total_size)
        f.flush()
        os.fsync(f.fileno())

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {} target_file.ext <search_string> [function_name]".format(__file__))
        exit(1)
    target_file = sys.argv[1]
    search_func = globals().get(sys.argv[3] if len(sys.argv) > 3 else None, in_place_wm)
    start_time = time.time()
    search_func(target_file, lambda x: x.startswith(sys.argv[2].encode("utf-8")))
    # some info for the test runner...
    print("python_version: " + sys.version.split()[0])
    print("python_time: {:.2f}".format(time.time() - start_time))

Next step is to build a tester that will run these functions in as isolated environment as possible, trying to obtain a fair benchmark for each of them. My test is structured as:

Three sample data CSVs are generated as 1Mx10 matrices (~200MB files) of random numbers with an identifiable line placed at the beginning, middle and the end of them respectively, thus generating test cases for three extreme scenarios.
The master sample data files are copied as temporary files (since line removal is destructive) before each test.
Various methods of file syncing and cache clearing are employed to ensure clean buffers before each test starts.
The tests are run using the highest priority (chrt -f 99) through /usr/bin/time for benchmark since Python cannot really be trusted to accurately measure its performance in scenarios like these.
At least three runs of each test are performed to smooth out unpredictable fluctuations.
The tests are also run in Python 2.7 and Python 3.6 (CPython) to see if there is performance consistency between versions.
All benchmark data is collected and saved as a CSV for future analysis.

Unfortunately, I didn't have a system at hand where I could run the test fully isolated so my numbers are obtained from running it in a hypervisor. This means that the I/O performance is probably very skewed, but it should similarly affect all the tests still providing comparable data. Either way, you're welcome to run this test on your own system to get results you can relate to.

I've set a test script performing the aforementioned scenario as:

#!/usr/bin/env python

import collections
import os
import random
import shutil
import subprocess
import sys
import time

try:
    range = xrange  # cover Python 2.x
except NameError:
    pass

try:
    DEV_NULL = subprocess.DEVNULL
except AttributeError:
    DEV_NULL = open(os.devnull, "wb")  # cover Python 2.x

SAMPLE_ROWS = 10**6  # 1M lines
TEST_LOOPS = 3
CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py")  # the above script

def get_temporary_path(path):
    folder, filename = os.path.split(path)
    return os.path.join(folder, "~$" + filename)

def generate_samples(path, data="LINE", rows=10**6, columns=10):  # 1Mx10 default matrix
    sample_beginning = os.path.join(path, "sample_beg.csv")
    sample_middle = os.path.join(path, "sample_mid.csv")
    sample_end = os.path.join(path, "sample_end.csv")
    separator = os.linesep
    middle_row = rows // 2
    with open(sample_beginning, "w") as f_b, \
            open(sample_middle, "w") as f_m, \
            open(sample_end, "w") as f_e:
        f_b.write(data)
        f_b.write(separator)
        for i in range(rows):
            if not i % middle_row:
                f_m.write(data)
                f_m.write(separator)
            for t in (f_b, f_m, f_e):
                t.write(",".join((str(random.random()) for _ in range(columns))))
                t.write(separator)
        f_e.write(data)
        f_e.write(separator)
    return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end)

def normalize_field(field):
    field = field.lower()
    while True:
        s_index = field.find('(')
        e_index = field.find(')')
        if s_index == -1 or e_index == -1:
            break
        field = field[:s_index] + field[e_index + 1:]
    return "_".join(field.split())

def encode_csv_field(field):
    if isinstance(field, (int, float)):
        field = str(field)
    escape = False
    if '"' in field:
        escape = True
        field = field.replace('"', '""')
    elif "," in field or "\n" in field:
        escape = True
    if escape:
        return ('"' + field + '"').encode("utf-8")
    return field.encode("utf-8")

if __name__ == "__main__":
    print("Generating sample data...")
    start_time = time.time()
    samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS)
    print("Done, generation took: {:2} seconds.".format(time.time() - start_time))
    print("Beginning tests...")
    search_string = "REMOVE"
    header = None
    results = []
    for f in ("temp_file_stream", "temp_file_wm",
              "in_place_stream", "in_place_wm", "in_place_mmap"):
        for s, path in samples:
            for test in range(TEST_LOOPS):
                result = collections.OrderedDict((("function", f), ("sample", s),
                                                  ("test", test)))
                print("Running {function} test, {sample} #{test}...".format(**result))
                temp_sample = get_temporary_path(path)
                shutil.copy(path, temp_sample)
                print("  Clearing caches...")
                subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL)
                with open("/proc/sys/vm/drop_caches", "w") as dc:
                    dc.write("3\n")  # free pagecache, inodes, dentries...
                # you can add more cache clearing/invalidating calls here...
                print("  Removing a line starting with `{}`...".format(search_string))
                out = subprocess.check_output(["sudo", "chrt", "-f", "99",
                                               "/usr/bin/time", "--verbose",
                                               sys.executable, CALL_SCRIPT, temp_sample,
                                               search_string, f], stderr=subprocess.STDOUT)
                print("  Cleaning up...")
                os.remove(temp_sample)
                for line in out.decode("utf-8").split("\n"):
                    pair = line.strip().rsplit(": ", 1)
                    if len(pair) >= 2:
                        result[normalize_field(pair[0].strip())] = pair[1].strip()
                results.append(result)
                if not header:  # store the header for later reference
                    header = result.keys()
    print("Cleaning up sample data...")
    for s, path in samples:
        os.remove(path)
    output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv"
    output_results = os.path.join(os.getcwd(), output_file)
    print("All tests completed, writing results to: " + output_results)
    with open(output_results, "wb") as f:
        f.write(b",".join(encode_csv_field(k) for k in header) + b"\n")
        for result in results:
            f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n")
    print("All done.")

Finally (and TL;DR): here are my results - I'm extracting only best time and memory data from the result set, but you can get the full result sets here: Python 2.7 Raw Test Data and Python 3.6 Raw Test Data.

Based on the data I gathered, a couple of final notes:

If working memory is an issue (working with exceptionally large files etc.), only the *_stream functions provide small footprint. On Python 3.x a mid-way would be the mmap technique.
If storage is an issue, only the in_place_* functions are viable.
If both are scarce, the only consistent technique is the in_place_stream but at the expense of processing time and increased I/O calls (compared to *_wm functions).
in_place_* functions are dangerous as they may lead to data corruption if they are stopped mid-way. temp_file_* functions (without integrity checks) are only dangerous on non-transactional file systems.

Nice solution - but note you're still reading the rest of the file, and writing it. I wonder if truncating gains any run-time advantage if done on a single line or several, like in my solution.
I'm willing to wager a single write would by faster than multiple, but upvote for creative solution.
@kabanus - it just avoids reading everything to the working memory (think 10GB files and such), overwriting of the file contents must be performed either way. And, as I've mentioned in the beginning, out of both approaches writing everything to a temp file and then overwriting the old file should be, in theory, the fastest option - if disk space is not an issue.
Ahh, didn't consider that - correct, this method is much preferable in that case.
@kabanus - You'd wager correctly ;) I just posted my benchmarks so if you're interested take a look.

UlfR · Accepted Answer · 2018-04-09 16:58:02Z

3

Use sed:

sed -ie "/Sahra/d" your_file

Edit, Sorry I did not fully read all the tags and comments about the need to use python. Either way I would probably try to solve it with some preprocessing using some shell-utility to avoid all that extra code proposed in the other answers. But since I do not fully know your problem it might not be possible?

Good luck!

edited Apr 9, 2018 at 16:58

answered Apr 9, 2018 at 14:02

UlfR

4,5053 gold badges39 silver badges51 bronze badges

2 Comments

Maximilian Peters Over a year ago

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value.

user000001 Over a year ago

GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output., gnu.org/software/sed/manual/sed.html

Jasmijn · Accepted Answer · 2018-04-09 17:00:00Z

3

This is one way. You do have to load the rest of the file to a buffer, but it's the best I can think of in Python:

with open('afile','r+') as fd:
    delLine = 4
    for i in range(delLine):
        pos = fd.tell()
        fd.readline()
    rest = fd.read()
    fd.seek(pos)
    fd.truncate()
    fd.write(rest)
    fd.close()

I solved this as if you know the line number. If you want to check the text then instead of the above loop:

pos = fd.tell()
while fd.readline().startswith('Sarah'): pos = fd.tell()

There will be an exception if 'Sarah' isn't found.

This maybe more efficient if the line you are deleting is nearer to the end, but I'm not sure reading everything, dropping the line, and dumping it back will save much compared to user time (considering this is a Tk app). This also needs only to open and flush once to the file once, so unless files are extremely long, and Sarah is real far down it probably won't be noticeable.

edited Apr 9, 2018 at 17:00

Jasmijn

10.8k2 gold badges34 silver badges50 bronze badges

answered Apr 9, 2018 at 12:35

kabanus

26.3k7 gold badges48 silver badges80 bronze badges

6 Comments

kabanus Over a year ago

After reading the other answers, this may be the best any program can do, but I'm unsure.

zwer Over a year ago

You can do in-place editing without reading the rest of the file into the working memory. Check my answer. I'm unsure that it would be more efficient than just using a temporary file, tho.

Jasmijn Over a year ago

Note that I've edited this answer to not open the file twice. I have not otherwise verified the algorithm for correctness, though.

kabanus Over a year ago

@Robin Thanks, that was a left over from an old copy-paste.

Digital Trauma Over a year ago

Check my answers to this question. They essentially do the same thing (in tcl and bash (with and without dd), with the exception that there is no need to read the rest of the file in one go (potential memory exhaustion). Instead you only need to read blocks the same size as the line that was deleted. I think this should easily be doable in python.

|

shep4rd · Accepted Answer · 2018-04-09 12:51:16Z

2

You can do it using Pandas. If your data is saved under data.csv, the following should help:

import pandas as pd

df = pd.read_csv('data.csv')
df = df[df.fname != 'Sarah' ]
df.to_csv('data.csv', index=False)

answered Apr 9, 2018 at 12:51

shep4rd

869 bronze badges

1 Comment

Maarten Fabré Over a year ago

as long as support for missing integer values is missing, I would take the utmost care with reading and writing back immediately. When the numbers are small, this will not make a large difference, but with numbers that are too large to be exactly represented by float, like long ID's for example, this will give you a lot of trouble

Maxim Egorushkin · Accepted Answer · 2018-04-09 12:46:46Z

1

What is the most efficient way to remove Sarah's row? If possible, I would like to avoid copying the entire file.

The most efficient way is to overwrite that row with something that the csv parser ignores. This avoids having to move the rows following the removed one.

If your csv parser can ignore empty lines, overwrite that row with \n symbols. Otherwise, if your parser strips whitespace from values, overwrite that row with (space) symbols.

answered Apr 9, 2018 at 12:46

Maxim Egorushkin

138k19 gold badges201 silver badges293 bronze badges

3 Comments

kabanus Over a year ago

I think how to overwrite is the point here. If you can overwrite a line in a file you can just overwrite all of it with '', can't you?

Maxim Egorushkin Over a year ago

@kabanus Overwrite only a part of the file. seek + write, similar to what your solution does.

kabanus Over a year ago

Yup, I was guessing that OP wanted an how-to, but I agree.

Rachit kapadia · Accepted Answer · 2018-04-09 12:45:29Z

0

This might help:

with open("sample.csv",'r') as f:
    for line in f:
        if line.startswith('sarah'):continue
        print(line)

edited Apr 9, 2018 at 12:45

answered Apr 9, 2018 at 12:22

Rachit kapadia

6997 silver badges18 bronze badges

1 Comment

Chris_Rands Over a year ago

isn't this the exact naive approach that the question wants to avoid? Also it's tagged Python 3, print is a function

Collectives™ on Stack Overflow

Remove a single row from a csv without copying files

7 Answers 7

4 Comments

5 Comments

2 Comments

6 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

5 Comments

2 Comments

6 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related