Monday, December 24, 2007

How to remove repeated lines in a file without changing the order

Using just uniq is no viable as repeated lines need to be next to each other for uniq to identify it. Using sort will just mix them up, loosing their original placement within the file. So here's a work around (originally from Linux Journal).

Let's say we have a document called file:

$ cat file

Here are the steps we will take:
1- Use nl (or cat -) to add a numbering to the each line;
2- Use “sort -k 2” to place equal line after each other (we have to sort by the second column);
3- “Uniq -f 1” will remove equal lines (we also have to use the second column);
4- “sort -n” will re-add them in the proper order as per the first field, or the numbers
4- “sed 's/[0-9]//g'” will remove the numbers

This is what you command should look like:

$ nl file | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9]//g'

On my machine I had a problem where nl kept adding empty fields (still trying to find why), so I had to modify my expression a little bit:

$ nl file | expand | tr -s '[:blank:]' | sed 's/^ *//g' | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9]//g' | sed 's/^ *//g'


Now, let's say your sources.list got mixed up somehow, and all lines are now duplicate. We can apply the same concept like this:

$ grep -v '^#' sources.list | nl | expand | tr -s '[:blank:]' | sed 's/^ *//g' | sort -k 2 | uniq -f 1 | sort -n | sed 's/[0-9] *//g'

deb cdrom:[Ubuntu ._Gutsy Gibbon_ - Release i()]/ gutsy main restricted
deb gutsy multiverse
deb-src gutsy multiverse
deb gutsy-updates multiverse
deb-src gutsy-updates multiverse
deb gutsy-backports main restricted universe multiverse
deb-src gutsy-backports main restricted universe multiverse
deb gutsy partner
deb-src gutsy partner
deb gutsy-security main restricted
deb-src gutsy-security main restricted
deb gutsy-security universe
deb-src gutsy-security universe
deb gutsy-security multiverse
deb-src gutsy-security multiverse
deb gutsy universe multiverse
deb-src gutsy universe multiverse
deb edgy main
deb gutsy main restricted
deb-src gutsy main restricted
deb gutsy-updates main restricted
deb-src gutsy-updates main restricted
deb gutsy universe
deb-src gutsy universe
deb gutsy-updates universe
deb-src gutsy-updates universe



TuxSax said...

Great, sort of what I was looking for, I just have a problem.
I need the same action but with an array.
If I have, for example:
array=(1 2 3 4 1 2 3 4 1 2 3 4)
and I want to create a list for every "unique" value on that array.
I tried this to no avail:
array2=( `echo ${array[@]} | sort | uniq -u` )
It doesn't work of course because the array values are printed all in a single line.
Do you have an idea of how can I achieve this? I'd prefer it to be on the fly and not saving it to a file and reading from it later...
Thanks in advance

Anonymous said...

thanks for the interesting information

Anonymous said...

Many thanks.

Anonymous said...

Just one small thing, the sed [0-9] bit won't just remove the line numbers at the start, it will remove all numbers anywhere in each line.

sed 's/^ *[0-9]*//g'

worked for me without trashing the rest of the line.

Anonymous said...

The standard way to do this, at least for old timers is

awk '!s[$0]++'