Monday, December 24, 2007

How to remove repeated lines in a file without changing the order


Using just uniq is no viable as repeated lines need to be next to each other for uniq to identify it. Using sort will just mix them up, loosing their original placement within the file. So here's a work around (originally from Linux Journal).

Let's say we have a document called file:

$ cat file
a
b
c
d
a
b
c
d

Here are the steps we will take:
1- Use nl (or cat -) to add a numbering to the each line;
2- Use “sort -k 2” to place equal line after each other (we have to sort by the second column);
3- “Uniq -f 1” will remove equal lines (we also have to use the second column);
4- “sort -n” will re-add them in the proper order as per the first field, or the numbers
4- “sed 's/[0-9]//g'” will remove the numbers

This is what you command should look like:

$ nl file | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9]//g'
a
b
c
d

On my machine I had a problem where nl kept adding empty fields (still trying to find why), so I had to modify my expression a little bit:

$ nl file | expand | tr -s '[:blank:]' | sed 's/^ *//g' | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9]//g' | sed 's/^ *//g'

a
b
c
d

Now, let's say your sources.list got mixed up somehow, and all lines are now duplicate. We can apply the same concept like this:

$ grep -v '^#' sources.list | nl | expand | tr -s '[:blank:]' | sed 's/^ *//g' | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9] *//g'

deb cdrom:[Ubuntu ._Gutsy Gibbon_ - Release i()]/ gutsy main restricted
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy multiverse
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy multiverse
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates multiverse
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates multiverse
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy-backports main restricted universe multiverse
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy-backports main restricted universe multiverse
deb http://archive.canonical.com/ubuntu gutsy partner
deb-src http://archive.canonical.com/ubuntu gutsy partner
deb http://security.ubuntu.com/ubuntu gutsy-security main restricted
deb-src http://security.ubuntu.com/ubuntu gutsy-security main restricted
deb http://security.ubuntu.com/ubuntu gutsy-security universe
deb-src http://security.ubuntu.com/ubuntu gutsy-security universe
deb http://security.ubuntu.com/ubuntu gutsy-security multiverse
deb-src http://security.ubuntu.com/ubuntu gutsy-security multiverse
deb http://archive.ubuntu.com/ubuntu gutsy universe multiverse
deb-src http://archive.ubuntu.com/ubuntu gutsy universe multiverse
deb http://wine.budgetdedicated.com/apt edgy main
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy main restricted
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy main restricted
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates main restricted
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates main restricted
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy universe
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy universe
deb http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates universe
deb-src http://ca.archive.ubuntu.com/ubuntu/ gutsy-updates universe


Vic.

5 comments:

TuxSax said...

Great, sort of what I was looking for, I just have a problem.
I need the same action but with an array.
If I have, for example:
array=(1 2 3 4 1 2 3 4 1 2 3 4)
and I want to create a list for every "unique" value on that array.
I tried this to no avail:
array2=( `echo ${array[@]} | sort | uniq -u` )
It doesn't work of course because the array values are printed all in a single line.
Do you have an idea of how can I achieve this? I'd prefer it to be on the fly and not saving it to a file and reading from it later...
Thanks in advance

Anonymous said...

thanks for the interesting information

Anonymous said...

Many thanks.

Anonymous said...

Just one small thing, the sed [0-9] bit won't just remove the line numbers at the start, it will remove all numbers anywhere in each line.

sed 's/^ *[0-9]*//g'

worked for me without trashing the rest of the line.

Anonymous said...

The standard way to do this, at least for old timers is

awk '!s[$0]++'