Americas

  • United States
sandra_henrystocker
Unix Dweeb

Counting individual characters on Linux

How-To
Oct 26, 20225 mins
Linux

If you need to count how many of each character is included in a file or phrase, there are some handy commands you can string together to accomplish this along with scripts and aliases that can make the job easy.

Determining how many characters are in a file is easy on the Linux command line: use the ls -l command.

On the other hand, if you want to get a count of how many times each character appears in your file, you’re going to need a considerably more complicated command or a script. This post covers several different options.

Counting how many times each character appears in a file

To count how many of each character are included in a file, you need to string together a series of commands that will consider each character and use a sort command before it counts how many of each character are included.

To do that, you can use a command like this one:

$ cat myfile | sed 's/(.)/n1/g' | sort | uniq -c | column
     24              58 c           112 i           132 o             7 T
    254               2 C             3 I             2 O            30 u
      1 '            50 d             4 j            29 p            23 v
     25 ,           163 e             5 k             1 P             9 w
     20 .             2 E            60 l             2 q             4 x
    142 a            21 f            48 m            90 r            36 y
      5 A            16 g             2 M             1 R             3 z
     23 b             1 G           117 n           147 s
      1 B            51 h             1 N           119 t

The sed command will separate the file into a single character chunks. That output is then sorted by the sort command. After that, each group of the same character is counted by the uniq -c command and the column command is used to create the multi-column output. Since the results are based on the file content, no characters are listed besides those in the file.

Notice that the output displays the list of characters in the selected file in alphanumeric order thanks to the sort command. The first two characters aren’t shown because linefeeds and spaces are only recognizable in context.

If you want to display the characters in frequency order instead, all you need to do is add a second sort command using the -g (general numeric).

$ cat myfile | sed 's/(.)/n1/g' | sort | uniq -c | sort -g | column
      1 '             2 O             9 w            30 u           117 n
      1 B             2 q            16 g            36 y           119 t
      1 G             3 I            20 .            48 m           132 o
      1 N             3 z            21 f            50 d           142 a
      1 P             4 j            23 b            51 h           147 s
      1 R             4 x            23 v            58 c           163 e
      2 C             5 A            24              60 l           254
      2 E             5 k            25 ,            90 r
      2 M             7 T            29 p           112 i

To reverse the listing to show the most frequently used characters first, add an r (reverse) option to that last sort command.

$ cat myfile | sed 's/(.)/n1/g' | sort | uniq -c | sort -gr | column
    254              60 l            24               5 A             2 C
    163 e            58 c            23 v             4 x             1 R
    147 s            51 h            23 b             4 j             1 P
    142 a            50 d            21 f             3 z             1 N
    132 o            48 m            20 .             3 I             1 G
    119 t            36 y            16 g             2 q             1 B
    117 n            30 u             9 w             2 O             1 '
    112 i            29 p             7 T             2 M
     90 r            25 ,             5 k             2 E

The character at the top of the list is, as I assume you guessed, the space character. The second most often used character in the file is an “e”. No surprise there either. In addition, capital letters are listed last since they are not frequently used.

Note that if you don’t want to distinguish between uppercase and lowercase letters you can insert a tr (translate) command into the command string like this:

$ cat myfile | sed 's/(.)/n1/g' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -gr | column"
    254             115 i            36 y            21 f             3 z
    165 e            91 r            30 u            20 .             2 q
    147 s            60 l            30 p            17 g             1 '
    147 a            60 c            25 ,             9 w
    134 o            51 h            24 b             5 k
    126 t            50 m            24               4 x
    118 n            50 d            23 v             4 j

Switch the positions of the “upper” and “lower” arguments to display the results all in uppercase.

Counting character-by-character in a word or phrase

You can also use a command similar to those shown above to count how many times each letter appears in a single word or phrase. Here’s an example:

$ echo "Hello, World!" | sed 's/(.)/n1/g' | sort | uniq -c | sort -gr |  column
      3 l             1 r             1 d             1
      2 o             1 H             1 ,             1
      1 W             1 e             1 !

Using an alias

While the commands shown above are clever, they’re not easy to remember or type. Creating an alias can help with this. Once you decide what form of output you prefer, turn the command into an alias like this:

$ alias CountChars="sed 's/(.)/n1/g' | sort | uniq -c | sort -gr | column"

Save the alias in your .bashrc file so that you can use it as needed. Then use it in commands like these:

$ cat myfile | CountChars
    254              60 l            24               5 A             2 C
    163 e            58 c            23 v             4 x             1 R
    147 s            51 h            23 b             4 j             1 P
    142 a            50 d            21 f             3 z             1 N
    132 o            48 m            20 .             3 I             1 G
    119 t            36 y            16 g             2 q             1 B
    117 n            30 u             9 w             2 O             1 '
    112 i            29 p             7 T             2 M
     90 r            25 ,             5 k             2 E
$ echo "Hello, World!" | CountChars
      3 l             1 r             1 d             1
      2 o             1 H             1 ,             1
      1 W             1 e             1 !

Using a script

If you want to see only alphabetic characters, you can use a script like the one shown below. It first changes all the letters to lowercase before it runs through the alphabet, uses awk to count the number of times each letter appears and then displays the counts only if they’re larger than 1. It only works with whatever string is provided as an argument.

#!/bin/bash

# make argument all lowercase
string=$(echo $1 | tr '[:upper:]' '[:lower:]')

for char in {a..z}
do
  count=`awk -F"${char}" '{print NF-1}' 

Run it like this:

$ CountByChar "Hello, World!"
d:1
e:1
h:1
l:3
o:2
r:1
w:1

Note that characters will always be listed in alphabetical order. You can pipe the output to the column command if you want fewer lines of output.

$ CountByChar "Hello, World!" | column
d:1     e:1     h:1     l:3     o:2     r:1     w:1

Wrap-up

Whether you’re looking for character counts in files or phrases, there are some handy options available. Turning the complex ones into aliases is probably the best way to make the task easy.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.