2

I need to deduplicate a large CSV file. Turn out I cannot simply use LC_ALL=C sort -u to sort and remove the duplicates lines since my CSV is using double quotes and embedded end-of-line in value.

I found those threads on internet but they are of no use in my case

How do I sort & de-duplicate lines from a CSV file using double quotes (csvkit solution acceptable)?


% cat reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Where end of line is UNIX:

% cat -e reduced.csv | head -1
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"$

And file reports:

% file -k reduced.csv
reduced.csv: CSV text\012- , Unicode text, UTF-8 text
2
  • Not sure what you mean by "embedded end-of-line"? Commented 6 hours ago
  • 2
    @jubilatious1 \n is contained in the value making a CSV entry spread on multiple text lines. This is possible in CSV using double quotes. Commented 5 hours ago

4 Answers 4

5

miller's uniq verb can do that:

mlr -N --csv --quote-all uniq -a reduced.csv

(--quote-all here being to match the quoting style in your input where everything is quoted; you can omit it to have a more compact output).

$ mlr -N --quote-all --csv uniq -a reduced.csv | diff -U20 reduced.csv -
--- reduced.csv 2025-12-16 08:02:24.595343963 +0000
+++ -   2025-12-16 08:10:54.911789734 +0000
@@ -1,13 +1,9 @@
 "/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
 "/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
 "/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

If you also need to sort records as per the first, then second, ... then fifth column as per your own answer:

mlr -N --csv --quote-all uniq -a then sort -f 1,2,3,4,5 reduced.csv

(with -N for headerless csv, columns are implicitly named 1, 2...).

3
  • Can you extend your solution with % mlr -N --csv sort-within-records then uniq -a reduced.csv, having the sort makes the output predictable Commented 10 hours ago
  • @malat, AFAICT, sort-within-records doesn't make sense for csv. Commented 10 hours ago
  • indeed, I am still struggling with mlr command line options... Commented 10 hours ago
2

With GNU awk and GNU sort you could convert the end-of-row \ns to NUL chars, sort on that output, then convert the NULs back to \ns (untested):

$ awk --csv -v ORS='\0' '1' reduced.csv |
    sort -zu |
    awk -v RS='\0' -v ORS='\n' '1'

You could do the sorting within GNU awk, but using the UNIX tool sort instead would probably be more efficient and can handle larger input.

Alternatively, if you want a simple solution that'll work using standard tools on any UNIX-like system and you can pick some character or string that you know can't be present in your input to replace \n with inside the quotes, then you can do this using any awk to replace those newlines with that string:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

I used the string __NEWLINE__ to make it visible but you could pick some control character or any other string that doesn't contain quotes, commas, or, of course, newlines.

Now you can just pipe the result to any sort:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

and then convert your replacement string back to \n again using awk:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u |
    awk '{gsub(/__NEWLINE__/,"\n"); print}'
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

See whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more information on processing CSVs with awk.

0
0

While miller seems like a good start solution I could not make sense of the documentation to sort the CSV file (allow for some stability).

Instead I went back to csvkit documentation and I found this:

% csvsql --blanks --no-header-row --query \
  "select distinct * from 'reduced' ORDER BY a,b,c,d,e;" reduced.csv | sed 1d

I have both sorting and unicity in a single pass (assuming temporary file).

-1

Using Raku (formerly known as Perl_6)

~$ raku -e '.put for lines.unique.sort;'  malat_csv.txt

I don't understand why this problem can't be solved by Perl or Raku. A concern would be having similar lines with non-normalized accents, thus á is normalized to one-character (U+E1 "LATIN SMALL LETTER A WITH ACUTE"), even if the original was two (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT").

Raku is Unicode-ready, and input text is normalized by default:

Sample Input:

"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Sample Output:

"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
TC CRANIO ENCEFALO s/c mdc
TC TORAC"

https://docs.raku.org/language/unicode#Normalization
https://raku.org

1
  • 3
    That sorts/uniqs physical lines, not CSV rows. Look where the `TC CRANIO" and "TC TORAC" continuation lines ended up. Commented 4 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.