Remove duplicate entries in a Bash script [duplicate]

Question

I want to remove duplicate entries from a text file, e.g:

kavitha= Tue Feb    20 14:00 19 IST 2012  (duplicate entry) 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012 
kavitha= Tue Feb    20 14:00 19 IST 2012 (duplicate entry)

Is there any possible way to remove the duplicate entries using a Bash script?

Desired output

kavitha= Tue Feb    20 14:00 19 IST 2012 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012

Ironic that this question itself is a duplicate...

Siôn le Roux
– Siôn le Roux

2021-09-13 14:26:50 +00:00
Commented Sep 13, 2021 at 14:26 — Siôn le Roux
– Siôn le Roux, Commented Sep 13, 2021 at 14:26

Hugo · Accepted Answer · 2016-04-23 19:36:13Z

500

You can sort then uniq:

$ sort -u input.txt

Or use awk:

$ awk '!a[$0]++' input.txt

edited Apr 23, 2016 at 19:36

Hugo

29.8k9 gold badges87 silver badges102 bronze badges

answered Feb 21, 2012 at 11:52

kev

163k49 gold badges286 silver badges282 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Hugo Over a year ago

Testing with an 18,500 line text file: sort ... takes about 0.57s whereas awk ... takes about 0.08s because awk ... just removes duplicates without sorting.

Tegan Snyder Over a year ago

@Hugo I can second that. Testing against 2,626,198 lines awk beats sort. Results show awk taking 5.675s and sort taking 5.675s. Interestingly enough the same record set took 15.1 seconds to perform a MySQL DISTINCT query on.

Onichan Over a year ago

@Hugo is there an elegant way to make this work with case insensitive? or is it better to just convert the entire doc to lowercase, then run this?

bhelm Over a year ago

tested with 24 million rows, awk did not come to a result within 20 minutes. sort + uniq did the job in some secs.

lab419 Over a year ago

I downvoted this because, although poster is happy, folks could be confused by an answer that does not yield the desired output, as it sorts the input

|

Siva Charan · Accepted Answer · 2012-02-21 11:53:49Z

21

It deletes duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.

sed '$!N; /^\(.*\)\n\1$/!P; D'

answered Feb 21, 2012 at 11:53

Siva Charan

18.1k10 gold badges64 silver badges95 bronze badges

3 Comments

vishal sahasrabuddhe Over a year ago

worked for me, One more addition for other use, If you want to change the file itself here is the command sed -i '$!N; /^$.*$\n\1$/!P; D' <FileName>

Arnab Over a year ago

This is awesome !!

Geoff Langenderfer Over a year ago

what is $!N; ?

Chris Koknat · Accepted Answer · 2015-10-07 17:02:38Z

13

Perl one-liner similar to @kev's awk solution:

perl -ne 'print if ! $a{$_}++' input

This variation removes trailing whitespace before comparing:

perl -lne 's/\s*$//; print if ! $a{$_}++' input

This variation edits the file in-place:

perl -i -ne 'print if ! $a{$_}++' input

This variation edits the file in-place, and makes a backup input.bak

perl -i.bak -ne 'print if ! $a{$_}++' input

edited Oct 7, 2015 at 17:02

answered Sep 9, 2015 at 16:34

Chris Koknat

3,4892 gold badges34 silver badges31 bronze badges

2 Comments

Capt. Crunch Over a year ago

I like the Perl solution because it allows me to add extra conditions, e.g. only enforce uniqueness on lines matching a certain pattern.

pmor Over a year ago

Is perl -i -ne 'print if ! $a{$_}++' input faster (in genereal) than gawk -i inplace '!a[$0]++' input?

potong · Accepted Answer · 2012-02-21 14:46:34Z

0

This might work for you:

cat -n file.txt |
sort -u -k2,7 |
sort -n |
sed 's/.*\t/    /;s/\([0-9]\{4\}\).*/\1/'

or this:

 awk '{line=substr($0,1,match($0,/[0-9][0-9][0-9][0-9]/)+3);sub(/^/,"    ",line);if(!dup[line]++)print line}' file.txt

answered Feb 21, 2012 at 14:46

potong

59.3k6 gold badges55 silver badges92 bronze badges

Collectives™ on Stack Overflow

Remove duplicate entries in a Bash script [duplicate]

4 Answers 4

10 Comments

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

10 Comments

3 Comments

2 Comments

Comments

Linked

Related