5

I need to deduplicate a large CSV file. Turn out I cannot simply use LC_ALL=C sort -u to sort and remove the duplicates lines since my CSV is using double quotes and embedded end-of-line in value.

I found those threads on internet but they are of no use in my case

How do I sort & de-duplicate lines from a CSV file using double quotes (csvkit solution acceptable)?


% cat reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Where end of line is UNIX:

% cat -e reduced.csv | head -1
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"$

And file reports:

% file -k reduced.csv
reduced.csv: CSV text\012- , Unicode text, UTF-8 text
3
  • Not sure what you mean by "embedded end-of-line"? Commented 2 days ago
  • 2
    @jubilatious1 \n is contained in the value making a CSV entry spread on multiple text lines. This is possible in CSV using double quotes. Commented 2 days ago
  • Okay, I would have just said "embedded newlines". Will update my answer to use Raku's Text::CSV module, which can handle. Best Regards. Commented yesterday

6 Answers 6

8

miller's uniq verb can do that:

mlr -N --csv --quote-all uniq -a reduced.csv

(--quote-all here being to match the quoting style in your input where everything is quoted; you can omit it to have a more compact output).

$ mlr -N --quote-all --csv uniq -a reduced.csv | diff -U20 reduced.csv -
--- reduced.csv 2025-12-16 08:02:24.595343963 +0000
+++ -   2025-12-16 08:10:54.911789734 +0000
@@ -1,13 +1,9 @@
 "/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
 "/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
 "/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

If you also need to sort records as per the first, then second, ... then fifth column as per your own answer:

mlr -N --csv --quote-all uniq -a then sort -f 1,2,3,4,5 reduced.csv

(with -N for headerless csv, columns are implicitly named 1, 2...).

4
  • Can you extend your solution with % mlr -N --csv sort-within-records then uniq -a reduced.csv, having the sort makes the output predictable Commented 2 days ago
  • @malat, AFAICT, sort-within-records doesn't make sense for csv. Commented 2 days ago
  • indeed, I am still struggling with mlr command line options... Commented 2 days ago
  • 2
    @malat, I've reverted your change. I'm don't agree to adding distracting quotes here because some syntax linter somewhere complains about something that is not an issue. Correct approach would be to raise an issue to that linter that then is an expected argument to the mlr command. Commented 7 hours ago
3

With GNU awk and GNU sort you could convert the end-of-row \ns to NUL chars, sort on that output, then convert the NULs back to \ns (untested):

$ awk --csv -v ORS='\0' '1' reduced.csv |
    sort -zu |
    awk -v RS='\0' -v ORS='\n' '1'

You could do the sorting within GNU awk, but using the UNIX tool sort instead would probably be more efficient and can handle larger input.

Alternatively, if you want a simple solution that'll work using standard tools on any UNIX-like system and you can pick some character or string that you know can't be present in your input to replace \n with inside the quotes, then you can do this using any awk to replace those newlines with that string:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

I used the string __NEWLINE__ to make it visible but you could pick some control character or any other string that doesn't contain quotes, commas, or, of course, newlines.

Now you can just pipe the result to any sort:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

and then convert your replacement string back to \n again using awk:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u |
    awk '{gsub(/__NEWLINE__/,"\n"); print}'
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

See whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more information on processing CSVs with awk.

1
  • Note that it assumes there's consistent quoting in the input. For instance, it wouldn't deduplicate a,b,c and "a","b","c" rows. Commented yesterday
1

While miller seems like a good start solution I could not make sense of the documentation to sort the CSV file (allow for some stability).

Instead I went back to csvkit documentation and I found this:

% csvsql --blanks --no-header-row --query \
  "select distinct * from 'reduced' ORDER BY a,b,c,d,e;" reduced.csv | sed 1d

I have both sorting and unicity in a single pass (assuming temporary file).

1
  • Strange that it lets you specify no header on input but not on output. See also tail -n+2 for a maybe more idiomatic/simpler/quicker (if not shorter) way to skip that first line. For comparison, mlr's -N is short for --implicit-csv-header --headerless-csv-output Commented yesterday
1

Another SQL-based option: DuckDB.

duckdb -noheader -csv  -c 'select distinct * from q.csv order by all'

"/0032,1064/*/0008,0104",RS0442,NULL,NULL,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",T51166,HUO,NULL,CUERPO COMPLETO
"/0032,1064/*/0008,0104",?,?,NULL,TAC ADDOME COMPLETO CMC
"/0008,1032/*/0008,0104",000102,AGFA,1,"GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104",RS0442,NULL,NULL,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
1

If the deduplicated data plus a some overhead fits in RAM1, you can use Python's built-in csv library.

python -c '
    import csv;
    reader = csv.reader(open("reduced.csv", newline=""));
    writer = csv.writer(open("uniq.csv", "w", newline=""));
    writer.writerows(sorted(set(tuple(row) for row in reader)))'

This sorts the rows. Omit sorted to get the deduplicated rows in a random order.

The code should be mostly straightforward except for one technicality. The general idea is to read the rows into a set data structure, hence duplicated rows are effectively ignored, and finally write the elements of the set. The csv library converts between file contents and rows. The technicality is that the CSV reader code constructs a list of fields on each line, but Python's lists can't be put in a set data structure, so we use set(tuple(row) for row in reader) instead of set(reader) to transform each row into a tuple, which can be put in a set.

1 Or swap but performance might be terrible.

1
  • FWIW, I get command not found unless I replace python with python3, then "unexpected indent" unless I remove those leading spaces, then it seems to be adding CR characters except inside the contents of multiline cells. Commented 20 hours ago
0

Using Raku (formerly known as Perl_6)

~$ raku -MText::CSV -e 'my $csv=Text::CSV.new;  
         my @a = $csv.getline_all(open($*ARGFILES, :r, :!chomp));  
         @a.=sort: { .[0], .[1], .[2], .[3], .[4] };  
         @a.=unique(:as(*.Str)); csv(in => @a, out => $*OUT);'  file

This problem can be solved by Raku in conjunction with a CSV-parsing module from the Raku ecosystem (here Text::CSV).

A concern might be having similar lines with non-normalized accents, such as á represented by one-character (U+E1 "LATIN SMALL LETTER A WITH ACUTE"), or á represented by two-characters (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT"). Raku is Unicode-ready and input text is normalized by default, solving this problem.

Sample Input:

"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Sample Output A (redirect to a new file):

"/0008,1032/*/0008,0104",000102,AGFA,1,"GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104",RS0442,,,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",?,?,,"TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104",RS0442,,,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",T51166,HUO,,"CUERPO COMPLETO"

NOTE: It's often helpful to look at a language's internal representation of data. Below is Raku's internal representation of the Sample Output:

Sample Output B (replace final statement with .raku.put for @a;)

$["/0008,1032/*/0008,0104", "000102", "AGFA", "1", "GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"]
$["/0008,1032/*/0008,0104", "RS0442", "", "", "TC ADDOME COMPLETO s/c mdc\nTC CRANIO ENCEFALO s/c mdc\nTC TORAC"]
$["/0032,1064/*/0008,0104", "?", "?", "", "TAC ADDOME COMPLETO CMC"]
$["/0032,1064/*/0008,0104", "RS0442", "", "", "TC ADDOME COMPLETO s/c mdc\nTC CRANIO ENCEFALO s/c mdc\nTC TORAC"]
$["/0032,1064/*/0008,0104", "T51166", "HUO", "", "CUERPO COMPLETO"]

https://raku.land/zef:Tux/Text::CSV
https://docs.raku.org/language/unicode#Normalization
https://raku.org

2
  • 3
    That sorts/uniqs physical lines, not CSV rows. Look where the `TC CRANIO" and "TC TORAC" continuation lines ended up. Commented 2 days ago
  • @TobySpeight Will update my answer to use Raku's Text::CSV module, which can handle. Best Regards. Commented yesterday

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.