Home Data Center Counting and modifying lines, words and characters in Linux text files

Unix Dweeb

Counting and modifying lines, words and characters in Linux text files

How-To

Mar 21, 20235 mins

Linux

A series of commands ranging from simple to fairly complex will help you count lines, words or individual characters on the Linux command line.

Linux includes some useful commands for counting when it comes to text files. This post examines some of the options for counting lines and words and making changes that might help you see what you want.

Counting lines

Counting lines in a file is very easy with the wc command. Use a command like that shown below, and you’ll get a quick response.

$ wc -l myfile
132 myfile

What the wc command is actually counting is the number of newline characters in a file. So, if you had a single-line file with no newline character at the end, it would tell you the file has 0 lines,

The wc -l command can also count the lines in any text that is piped to it. In the example below, wc -l is counting the number of files and directories in the current directory.

$ ls -l | wc -l
1184

If you pipe text to a wc command with a hyphen as its argument, wc will count the lines, words and characters.

$ echo hello to you | wc -
      1       3      13 -

The responses show the number of lines (1), words (3) and characters (13 counting the newline).

If you want to get the same information for a file, pipe the file to the wc command as shown below.

$ cat notes | wc -
     48     613    3705 -

Counting words

For just a word count, use the w option as shown in the examples below.

$ wc -w notes
613 TT2
$ date | wc -w
7

Counting characters

To count the characters in a file, use the -c option. Keep in mind that this will count newline characters as well as letters and punctuation marks.

$ wc -c TT2
3705 TT2

Counting instances of particular words

Counting how many times a particular word appears in a file is a lot more complex. To count how many lines contain a word is considerably easier.

$ cat notes | grep the | wc -l
32
$ cat notes | grep [Tt]he | wc -l
40

The second command above counts lines containing “the” whether or not the word is capitalized. It still doesn’t tell you how many times “the” appears overall, because any line containing the word more than once gets counted only once.

Ignoring punctuation and capitalization

Some words (e.g., “The” and “the”) will appear in your word lists more than once. You’re also going to see strings like “end” and “end.” since the commands described above don’t separate words from punctuation. To move past these problems, some additional commands are added in the examples that follow.

Removing punctuation

In the command below, a file containing a long string of punctuation characters is passed to a tr -d command that removes all of them from the output. Notice how everything except the “Characters ” string is removed from the output.

$ cat punct-chars
Characters .?,"!;:'{}[]():
$ cat punct-chars | tr -d '[:punct:]'
Characters

Changing text to all lowercase

A tr command can turn all character to lowercase to ensure that words that start with a capital letter (often because they start the sentence) or contain all capitals aren’t listed separately from those appearing in all lowercase.

$ echo "Hello to YOU" | tr '[A-Z]' '[a-z]'
hello to you

Using a script

The script below sets up three sets of commands for extracting the contents of a text file and extracting the words using increasingly more thorough strategies, so that you can see the output at each phase.

NOTE: The script passes the final collections of output to the column command to make the output a little easier to view.

#!/bin/bash

echo -n "file: "
read file

# separate file into wor-per-line format
tr -s '[:blank:]' '[n]'  $file-2

# list words in columnar format
sort $file-2 | uniq -c | column

echo -n "try next command?> "
read ans

# removing punctuation
sort $file-2 | tr -d '[:punct:]' | uniq -c | column

echo -n "try next command?> "
read ans

# changing text to all lowercase
sort $file-2 | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | uniq -c | column

The output below shows what you would see if you ran the script against the following Einstein quote:

"Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."
― Albert Einstein

$ word-by-word
file: Einstein
      1 ―                     1 human                 2 the
      1 about                 1 I'm                   1 things
      1 Albert                1 infinite:             1 "Two
      2 and                   1 not                   1 universe
      1 are                   1 stupidity;            1 universe."
      1 Einstein              1 sure
try next command?> y
      1 ―                     1 human                 2 the
      1 about                 1 Im                    1 things
      1 Albert                1 infinite              1 Two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 Einstein              1 sure
try next command?> y
      1 ―                     1 human                 2 the
      1 about                 1 im                    1 things
      1 albert                1 infinite              1 two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 einstein              1 sure

Some of the effects of eliminating punctuation have a downside as they remove the apostrophes from contractions like “it’s”. The script also decapitalizes proper names.

Note that the hyphen is not removed from the Einstein quote by the punctuation elimination command. In addition, if your text includes left- and right-leaning double quotes, they also won’t be eliminated. This is because these characters are not included in the definition of ‘[:punct:]’.

Wrap-up

Linux includes a number of ways for counting lines, words and characters in text and for making modifications that help count the words. Some are just a bit more complex than others.

by Sandra Henry-Stocker

Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

Americas

Topics

About

Policies

Our Network

More

Counting and modifying lines, words and characters in Linux text files

A series of commands ranging from simple to fairly complex will help you count lines, words or individual characters on the Linux command line.

Counting lines

Counting words

Counting characters

Counting instances of particular words

Ignoring punctuation and capitalization

Removing punctuation

Changing text to all lowercase

Using a script

Wrap-up

More from this author

The logic of && and || on Linux

Using the apropos command on Linux

Most popular authors

Show me more

Frontier retains top spot among world's fastest supercomputers

Nvidia teases quantum accelerated supercomputers

Cisco adds AI features to AppDynamics On-Premises

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Has the hype around ‘Internet of Things’ paid off?

Are unused IPv4 addresses a secret gold mine?

Preparing for a 6G wireless world: Exciting changes coming to the wireless industry

Counting and modifying lines, words and characters in Linux text files

A series of commands ranging from simple to fairly complex will help you count lines, words or individual characters on the Linux command line.

Counting lines

Counting words

Counting characters

Counting instances of particular words

Ignoring punctuation and capitalization

Removing punctuation

Changing text to all lowercase

Using a script

Wrap-up

Related content

Compressing files using the zip command on Linux

High-bandwidth memory nearly sold out until 2026

NSA, FBI warn of email spoofing threat

Download our SASE and SSE enterprise buyer’s guide

Newsletter Promo Module Test

More from this author

The logic of && and || on Linux

Using the apropos command on Linux

Most popular authors

Show me more

Frontier retains top spot among world's fastest supercomputers

Nvidia teases quantum accelerated supercomputers

Cisco adds AI features to AppDynamics On-Premises

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Has the hype around ‘Internet of Things’ paid off?

Are unused IPv4 addresses a secret gold mine?

Preparing for a 6G wireless world: Exciting changes coming to the wireless industry