Americas

  • United States
sandra_henrystocker
Unix Dweeb

2 ways to remove duplicate lines from Linux files

How-To
Apr 06, 20224 mins
Linux

cord-cutting
Credit: Willis Lai / IDG

There are many ways to remove duplicate lines from a text file on Linux, but here are two that involve the awk and uniq commands and that offer slightly different results.

Remove duplicate lines with awk

The first command we’ll examine in this post is a very unusual awk command that systematically removes every line in the file that is encountered more than once. It leaves the first instance of the line intact, but “remembers” it and removes any duplicates encountered afterwards.

Here’s an example. Initially, the file looks like this:

Once upon a time, there was a lovely princess with a foul temper.
Whenever she went for a walk, she left her castle smiling,
but if she ran into anyone frowning or arguing with someone else,
she stopped and made an angry face.
Continue reading
If the princess ran into a friend who didn't want to chat with her,
she stopped and made an angry face.		
Continue reading				

The awk command that does this work looks like this:

$ awk '!x[$0]++' grouchy_princess
Once upon a time, there was a lovely princess with a foul temper.
Whenever she went for a walk, she left her castle smiling,
but if she ran into anyone frowning or arguing with someone else,
she stopped and made an angry face.
Continue reading
If the princess ran into a friend who didn't want to chat with her,

Note that each of the duplicated lines is now displayed only once and in its initial position.

In fact, if you simply want to see any duplicated lines, you only need to change the command in a minor way. Just remove the exclamation point (signifying “not”) and you will see only the duplicated lines:

$ awk 'x[$0]++' grouchy_princess
she stopped and made an angry face.
Continue reading

The only problem with the awk ‘!x[$0]++’ command is that it’s not all that easy to remember. On the other hand, it’s also not that hard to turn the command into a simple script. Mine looks like this:

$ cat rmdups
#!/bin/bash
awk '!x[$0]++' $1

The awk command removes duplicate lines from whatever file is provided as an argument. If you want to save the output to a file instead of displaying it, make it look like this:

#!/bin/bash
awk '!x[$0]++' $1 > $1-new

You can run the script shown using a command like “rmdups addresses”. If you use the second version, a file with “-new” added to the original file name will contain the output.

Remove duplicate lines with uniq

If you don’t need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.

$ sort grouchy_princess | uniq
but if she ran into anyone frowning or arguing with someone else,
Continue reading
If the princess ran into a friend who didn't want to chat with her,
Once upon a time, there was a lovely princess with a foul temper.
she stopped and made an angry face.
Whenever she went for a walk, she left her castle smiling,

In addition, if sorting the contents of your file contents is helpful, this approach may be ideal. While this technique doesn’t work all that well with fairy tales, it works just fine for lists of meeting attendees, grocery shopping lists etc.

This combined use of sort and uniq surrounding the file name means a command like it can’t be turned into an alias, but it could be turned into a simple script like this:

#!/bin/bash

if [ $# == 1 ]; then
  if [ -f $1 ]; then
    sort $1 | uniq
  fi
fi

The script verifies that an argument was provided and that it’s an existing file before it sorts it and sends the output to the uniq command.

Wrap-Up

Commands like those shown can be very helpful in cleaning up or verifying the content of text files, particularly lists in which you don’t want any line to show up multiple times. Turning the commands into a script makes it convenient to call on them whenever they might be helpful.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.