226

I want to remove duplicate entries from a text file, e.g:

kavitha= Tue Feb    20 14:00 19 IST 2012  (duplicate entry) 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012 
kavitha= Tue Feb    20 14:00 19 IST 2012 (duplicate entry) 

Is there any possible way to remove the duplicate entries using a Bash script?

Desired output

kavitha= Tue Feb    20 14:00 19 IST 2012 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012
1
  • 4
    Ironic that this question itself is a duplicate... Commented Sep 13, 2021 at 14:26

4 Answers 4

500

You can sort then uniq:

$ sort -u input.txt

Or use awk:

$ awk '!a[$0]++' input.txt
Sign up to request clarification or add additional context in comments.

10 Comments

Testing with an 18,500 line text file: sort ... takes about 0.57s whereas awk ... takes about 0.08s because awk ... just removes duplicates without sorting.
@Hugo I can second that. Testing against 2,626,198 lines awk beats sort. Results show awk taking 5.675s and sort taking 5.675s. Interestingly enough the same record set took 15.1 seconds to perform a MySQL DISTINCT query on.
@Hugo is there an elegant way to make this work with case insensitive? or is it better to just convert the entire doc to lowercase, then run this?
tested with 24 million rows, awk did not come to a result within 20 minutes. sort + uniq did the job in some secs.
I downvoted this because, although poster is happy, folks could be confused by an answer that does not yield the desired output, as it sorts the input
|
21

It deletes duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.

sed '$!N; /^\(.*\)\n\1$/!P; D'

3 Comments

Image
worked for me, One more addition for other use, If you want to change the file itself here is the command sed -i '$!N; /^\(.*\)\n\1$/!P; D' <FileName>
This is awesome !!
Image
what is $!N; ?
13

Perl one-liner similar to @kev's awk solution:

perl -ne 'print if ! $a{$_}++' input

This variation removes trailing whitespace before comparing:

perl -lne 's/\s*$//; print if ! $a{$_}++' input

This variation edits the file in-place:

perl -i -ne 'print if ! $a{$_}++' input

This variation edits the file in-place, and makes a backup input.bak

perl -i.bak -ne 'print if ! $a{$_}++' input

2 Comments

I like the Perl solution because it allows me to add extra conditions, e.g. only enforce uniqueness on lines matching a certain pattern.
Is perl -i -ne 'print if ! $a{$_}++' input faster (in genereal) than gawk -i inplace '!a[$0]++' input?
0

This might work for you:

cat -n file.txt |
sort -u -k2,7 |
sort -n |
sed 's/.*\t/    /;s/\([0-9]\{4\}\).*/\1/'

or this:

 awk '{line=substr($0,1,match($0,/[0-9][0-9][0-9][0-9]/)+3);sub(/^/,"    ",line);if(!dup[line]++)print line}' file.txt

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.