Need csvuniq functionality on the command line for CSV with embedded end of line

Question

I need to deduplicate a large CSV file. Turn out I cannot simply use LC_ALL=C sort -u to sort and remove the duplicates lines since my CSV is using double quotes and embedded end-of-line in value.

I found those threads on internet but they are of no use in my case

How do I sort & de-duplicate lines from a CSV file using double quotes (csvkit solution acceptable)?

% cat reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Where end of line is UNIX:

% cat -e reduced.csv | head -1
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"$

And file reports:

% file -k reduced.csv
reduced.csv: CSV text\012- , Unicode text, UTF-8 text

@jubilatious1 \n is contained in the value making a CSV entry spread on multiple text lines. This is possible in CSV using double quotes. — malat
– malat, Commented 2 days ago
Okay, I would have just said "embedded newlines". Will update my answer to use Raku's Text::CSV module, which can handle. Best Regards. — jubilatious1
– jubilatious1, Commented yesterday

Stéphane Chazelas · Accepted Answer · 2025-12-18 10:01:27Z

8

miller's uniq verb can do that:

mlr -N --csv --quote-all uniq -a reduced.csv

(--quote-all here being to match the quoting style in your input where everything is quoted; you can omit it to have a more compact output).

$ mlr -N --quote-all --csv uniq -a reduced.csv | diff -U20 reduced.csv -
--- reduced.csv 2025-12-16 08:02:24.595343963 +0000
+++ -   2025-12-16 08:10:54.911789734 +0000
@@ -1,13 +1,9 @@
 "/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
 "/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
 "/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
 "/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
 TC CRANIO ENCEFALO s/c mdc
 TC TORAC"
-"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
-"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

If you also need to sort records as per the first, then second, ... then fifth column as per your own answer:

mlr -N --csv --quote-all uniq -a then sort -f 1,2,3,4,5 reduced.csv

(with -N for headerless csv, columns are implicitly named 1, 2...).

edited 7 hours ago

answered 2 days ago

Stéphane Chazelas

588k96 gold badges1.1k silver badges1.7k bronze badges

Can you extend your solution with % mlr -N --csv sort-within-records then uniq -a reduced.csv, having the sort makes the output predictable

malat
– malat

2025-12-16 08:23:55 +00:00
Commented 2 days ago
@malat, AFAICT, sort-within-records doesn't make sense for csv.

Stéphane Chazelas
– Stéphane Chazelas

2025-12-16 08:26:51 +00:00
Commented 2 days ago
indeed, I am still struggling with mlr command line options...

malat
– malat

2025-12-16 08:28:17 +00:00
Commented 2 days ago
2

@malat, I've reverted your change. I'm don't agree to adding distracting quotes here because some syntax linter somewhere complains about something that is not an issue. Correct approach would be to raise an issue to that linter that then is an expected argument to the mlr command.

Stéphane Chazelas
– Stéphane Chazelas

2025-12-18 10:03:19 +00:00
Commented 7 hours ago

Add a comment |

Ed Morton · Accepted Answer · 2025-12-16 14:18:35Z

With GNU awk and GNU sort you could convert the end-of-row \ns to NUL chars, sort on that output, then convert the NULs back to \ns (untested):

$ awk --csv -v ORS='\0' '1' reduced.csv |
    sort -zu |
    awk -v RS='\0' -v ORS='\n' '1'

You could do the sorting within GNU awk, but using the UNIX tool sort instead would probably be more efficient and can handle larger input.

Alternatively, if you want a simple solution that'll work using standard tools on any UNIX-like system and you can pick some character or string that you know can't be present in your input to replace \n with inside the quotes, then you can do this using any awk to replace those newlines with that string:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

I used the string __NEWLINE__ to make it visible but you could pick some control character or any other string that doesn't contain quotes, commas, or, of course, newlines.

Now you can just pipe the result to any sort:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc__NEWLINE__TC CRANIO ENCEFALO s/c mdc__NEWLINE__TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

and then convert your replacement string back to \n again using awk:

$ awk -v RS='"' '!(NR%2){gsub(/\n/,"__NEWLINE__")} {printf "%s%s", sep, $0; sep=RS}' reduced.csv |
    sort -u |
    awk '{gsub(/__NEWLINE__/,"\n"); print}'
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"

See whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more information on processing CSVs with awk.

Note that it assumes there's consistent quoting in the input. For instance, it wouldn't deduplicate a,b,c and "a","b","c" rows. — Stéphane Chazelas
– Stéphane Chazelas, Commented yesterday

malat · Accepted Answer · 2025-12-16 08:49:54Z

1

While miller seems like a good start solution I could not make sense of the documentation to sort the CSV file (allow for some stability).

Instead I went back to csvkit documentation and I found this:

% csvsql --blanks --no-header-row --query \
  "select distinct * from 'reduced' ORDER BY a,b,c,d,e;" reduced.csv | sed 1d

I have both sorting and unicity in a single pass (assuming temporary file).

answered 2 days ago

malat

3,5415 gold badges33 silver badges52 bronze badges

Strange that it lets you specify no header on input but not on output. See also tail -n+2 for a maybe more idiomatic/simpler/quicker (if not shorter) way to skip that first line. For comparison, mlr's -N is short for --implicit-csv-header --headerless-csv-output

Stéphane Chazelas
– Stéphane Chazelas

2025-12-17 07:25:03 +00:00
Commented yesterday

Add a comment |

Vi. · Accepted Answer · 2025-12-17 16:54:33Z

1

Another SQL-based option: DuckDB.

duckdb -noheader -csv  -c 'select distinct * from q.csv order by all'

"/0032,1064/*/0008,0104",RS0442,NULL,NULL,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",T51166,HUO,NULL,CUERPO COMPLETO
"/0032,1064/*/0008,0104",?,?,NULL,TAC ADDOME COMPLETO CMC
"/0008,1032/*/0008,0104",000102,AGFA,1,"GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104",RS0442,NULL,NULL,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"

answered yesterday

Vi.

6,0157 gold badges41 silver badges75 bronze badges

Add a comment |

Gilles 'SO- stop being evil' · Accepted Answer · 2025-12-17 20:27:02Z

If the deduplicated data plus a some overhead fits in RAM¹, you can use Python's built-in csv library.

python -c '
    import csv;
    reader = csv.reader(open("reduced.csv", newline=""));
    writer = csv.writer(open("uniq.csv", "w", newline=""));
    writer.writerows(sorted(set(tuple(row) for row in reader)))'

This sorts the rows. Omit sorted to get the deduplicated rows in a random order.

The code should be mostly straightforward except for one technicality. The general idea is to read the rows into a set data structure, hence duplicated rows are effectively ignored, and finally write the elements of the set. The csv library converts between file contents and rows. The technicality is that the CSV reader code constructs a list of fields on each line, but Python's lists can't be put in a set data structure, so we use set(tuple(row) for row in reader) instead of set(reader) to transform each row into a tuple, which can be put in a set.

¹ _{Or swap but performance might be terrible.}

FWIW, I get command not found unless I replace python with python3, then "unexpected indent" unless I remove those leading spaces, then it seems to be adding CR characters except inside the contents of multiline cells. — Stéphane Chazelas
– Stéphane Chazelas, Commented 20 hours ago

jubilatious1 · Accepted Answer · 2025-12-18 09:29:35Z

Using Raku (formerly known as Perl_6)

~$ raku -MText::CSV -e 'my $csv=Text::CSV.new;  
         my @a = $csv.getline_all(open($*ARGFILES, :r, :!chomp));  
         @a.=sort: { .[0], .[1], .[2], .[3], .[4] };  
         @a.=unique(:as(*.Str)); csv(in => @a, out => $*OUT);'  file

This problem can be solved by Raku in conjunction with a CSV-parsing module from the Raku ecosystem (here Text::CSV).

A concern might be having similar lines with non-normalized accents, such as á represented by one-character (U+E1 "LATIN SMALL LETTER A WITH ACUTE"), or á represented by two-characters (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT"). Raku is Unicode-ready and input text is normalized by default, solving this problem.

Sample Input:

"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0008,1032/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0008,1032/*/0008,0104","000102","AGFA","1","GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104","RS0442","","","TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104","T51166","HUO","","CUERPO COMPLETO"
"/0032,1064/*/0008,0104","?","?","","TAC ADDOME COMPLETO CMC"

Sample Output A (redirect to a new file):

"/0008,1032/*/0008,0104",000102,AGFA,1,"GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"
"/0008,1032/*/0008,0104",RS0442,,,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",?,?,,"TAC ADDOME COMPLETO CMC"
"/0032,1064/*/0008,0104",RS0442,,,"TC ADDOME COMPLETO s/c mdc
TC CRANIO ENCEFALO s/c mdc
TC TORAC"
"/0032,1064/*/0008,0104",T51166,HUO,,"CUERPO COMPLETO"

NOTE: It's often helpful to look at a language's internal representation of data. Below is Raku's internal representation of the Sample Output:

Sample Output B (replace final statement with .raku.put for @a;)

$["/0008,1032/*/0008,0104", "000102", "AGFA", "1", "GAMMAGRAFIA. ÓSEA CUERPO COMPLETO"]
$["/0008,1032/*/0008,0104", "RS0442", "", "", "TC ADDOME COMPLETO s/c mdc\nTC CRANIO ENCEFALO s/c mdc\nTC TORAC"]
$["/0032,1064/*/0008,0104", "?", "?", "", "TAC ADDOME COMPLETO CMC"]
$["/0032,1064/*/0008,0104", "RS0442", "", "", "TC ADDOME COMPLETO s/c mdc\nTC CRANIO ENCEFALO s/c mdc\nTC TORAC"]
$["/0032,1064/*/0008,0104", "T51166", "HUO", "", "CUERPO COMPLETO"]

https://raku.land/zef:Tux/Text::CSV
https://docs.raku.org/language/unicode#Normalization
https://raku.org

That sorts/uniqs physical lines, not CSV rows. Look where the `TC CRANIO" and "TC TORAC" continuation lines ended up. — Toby Speight
– Toby Speight, Commented 2 days ago
@TobySpeight Will update my answer to use Raku's Text::CSV module, which can handle. Best Regards. — jubilatious1
– jubilatious1, Commented yesterday

Stack Exchange Network

Need csvuniq functionality on the command line for CSV with embedded end of line

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

Need csvuniq functionality on the command line for CSV with embedded end of line

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions