Americas

  • United States
sandra_henrystocker
Unix Dweeb

Health tips for Unix systems

How-To
May 10, 20179 mins
Data CenterLinux

Periodic health checks can help ensure that your Unix systems are going to be available when they’re needed most. In this post, we’re going to look at some aspects of performance that should be included in your system check-ups and some handy commands that will provide you with some especially useful insights.

CPU load

Probably the most obvious health check for a Unix/Linux system is to take a look at the CPU load on a system. This is the heartbeat of a Unix system. A healthy system will have CPU power to spare. And one of the best commands for giving you a quick and easy view of how hard your CPU is working is the top command. There are a number of measurements to focus on when you use the command.

Providing a lot of information on your system’s performance, top manages to be surprisingly concise in how it displays the measurements that it reports. In particular, the load average measurements can give you a clear view of how busy the CPU is, though the numbers only report the last 15 minutes’ worth of activity. Knowing how many processes on average are having to wait for their time on the processor tells you whether the system is working hard and how hard to keep up with demands. A load average of .50 would mean that, on average, every other time top checks, a process is having to wait to run. The three figures provided show the load averages over the last one, five, and fifteen minutes — so you get some perspective and can also get a feel for whether the load is getting heavier or lighter. Once these numbers climb to 1.00 (especially the fifteen-minute average), a system is likely hurting. If this number increases or persists for a considerably longer time, the system’s performance will be noticeably poor. But, again, we’re only looking at 15 minutes worth of data.

The top command also displays the number of running processes (196 in the listing below) and usage stats both for memory and swap space. On the system displayed below, swap is not being used at all. In fact, looking at the third line, you’ll see that the CPU is idle more than 99% of the time. This system is obviously only lightly used.

The memory and swap stats are shown the fourth and fifth lines of top’s output. With no swap in use and significant free memory, this system is clearly having an easy day — at least a very easy 15 minutes.

If there were any processes dominating the CPU, we’d see them in the list of tasks shown after the five summary lines. By default, top ranks its process list in order of CPU usage (highest first).

top - 20:47:17 up  4:25,  3 users,  load average: 0.54, 0.15, 0.05
Tasks: 196 total,   1 running, 195 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2017064 total,   662924 free,   448904 used,   905236 buff/cache
KiB Swap:  3635904 total,  3635904 free,        0 used.  1091240 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1775 shs       20   0  273460  79668  48676 S   0.3  3.9   0:49.03 compiz
 3811 shs       20   0    9944   3640   3092 R   0.3  0.2   0:00.52 top
    1 root      20   0   27360   6592   5132 S   0.0  0.3   0:02.48 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+
    6 root      20   0       0      0      0 S   0.0  0.0   0:00.02 ksoftirqd/0
    7 root      20   0       0      0      0 S   0.0  0.0   0:00.49 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/0
   10 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 watchdog/1
   15 root      rt   0       0      0      0 S   0.0  0.0   0:00.12 migration/1
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.03 ksoftirqd/1
   18 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:+

Using the sar command, you can get an idea if what you see in your top output has held true for a considerably longer period of time. In the example below, sar has been collecting data every ten minutes for almost an hour and a half.

stinkbug# sar
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

19:32:20     LINUX RESTART      (2 CPU)

07:35:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:45:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.80
07:55:01 PM     all      0.14      0.00      0.02      0.02      0.00     99.82
08:05:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.81
08:15:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.80
08:25:01 PM     all      0.15      0.00      0.02      0.02      0.00     99.82
08:35:01 PM     all      0.14      0.00      0.02      0.02      0.00     99.83
08:45:01 PM     all      0.22      0.00      0.06      0.05      0.00     99.67
08:55:01 PM     all      0.55      0.00      0.70      2.70      0.00     96.05
Average:        all      0.21      0.00      0.11      0.36      0.00     99.33

In this example, it’s clear that this system is consistently only lightly used.

One of the key benefits of sar is that can collect information around the clock, so that you can see how your system is performing even when you’re not available to look. You can also use it to look at how the system is running right now. In the example below, we’re asking for three data samples, each 5 seconds apart.

stinkbug# sar -u 5 3
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

09:04:09 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:04:14 PM     all      0.20      0.00      0.20      0.00      0.00     99.60
09:04:19 PM     all      0.10      0.00      0.20      0.00      0.00     99.70
09:04:24 PM     all      0.20      0.00      0.10      0.00      0.00     99.70
Average:        all      0.17      0.00      0.17      0.00      0.00     99.67

Both the top and sar commands shown above provide data on how the CPU on the system is spending its time. While largely 99% or more idle, the CPU on this system is also spending a small amount of time running user processes (“%user” or “us”) and a small amount of time for system tasks (“%system” or “sy”). On a busy system, these numbers can help you to determine why the system is so busy.

Memory Usage

To look just at memory and swap space, the free command is the most convenient one to use. It will display the same variety of data that top provides, but just the memory stats.

stinkbug$ free
             total       used       free     shared    buffers     cached
Mem:       2074932    1837504     237428          0     523476     815368
-/+ buffers/cache:     498660    1576272
Swap:      4192956        112    4192844

If you run the free command with the -m option, the numbers will be expressed in megabytes – probably easier on the eyes!

 

 

If you run the free command with the -m option, the numbers will be expressed in megabytes – probably easier on the eyes!

stinkbug$ free -m
             total       used       free     shared    buffers     cached
Mem:          2026       1794        231          0        511        796
-/+ buffers/cache:        486       1539
Swap:         4094          0       4094

The take-homes for this system are that swap space is not being used and a good amount of memory is free and available (nearly 1/3 of it not in use).

Paging and swapping

When the memory on a system is in high demand, the system has to use paging and swapping – the processes that move process data out of memory and off to the swap device and back when needed. This allows the system to behave as if it has more physical memory than it does, but comes at some cost in terms of performance. A system that is doing a lot of swapping will likely slow down considerably. The columns to focus on are the si (average number of LWPs swapped in per second) and so (number of whole processes swapped out) columns. These numbers are all 0 in the example below, but imagine them populated with numbers with two or three digits.

stinkbug# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 663184 132276 772964    0    0    15     2   16   53  0  0 99  0  0
stinkbug# vmstat 5 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 662936 132284 772996    0    0    15     2   16   53  0  0 99  0  0
 0  0      0 662928 132284 772996    0    0     0     0   27   52  0  0 100  0  0
 0  0      0 662928 132284 772996    0    0     0     1   28   55  0  0 100  0  0

Disk IO

The iostat command (particularly iostat -x)is useful for observing device input/output loading. Sometimes this information is used to justify changing the system configuration to better balance the load between devices. To make use of this information, you have to be able to translate the space-saving acronyms hovering over the device measurements — like rrqm/s and rkB/s.

rrqm/s, wrqm/s -- number of merged read and write requests queued per second
r/s, w/s -- number of read and write requests per second
rkB/s -- number of kilobytes read from the device per second
wkB/s -- number of kilobytes written to the device per second
avgrq-sz -- average request size (in sectors)
avgqu-sz -- number of requests waiting in the device’s queue
await -- average time (milliseconds) for I/O requests to be served
r_await, w_await -- average time (milliseconds) for read and write requests to be served
svctm -- number of milliseconds spent servicing request
%util -- percentage of CPU time during which requests were issued

Of these, the avgqu-sz is one of the most important. A low value generally indicates that your systems is not heavily loaded.

stinkbug# iostat -x 5 3
Linux 4.10.0-19-generic (stinkbug)      05/08/2017      _i686_  (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.24    0.01    0.09    0.33    0.00   99.33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.67     0.17    1.75    0.11    29.38     3.46    35.45     0.02   10.78    8.94   41.38   2.93   0.54

Disk space

Disks can fill up fast depending on what’s happening on a system. Be aware of disks that might be getting close to filling up. I’ve often set up systems that I managed to send me warnings when the used space reached particular thresholds — like 75% full, 90% full, and 98% full. In the example below, we see a couple of disks that are getting close.

dragonfly# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             78361192  23185840  51130588  32% /
/dev/sda2             24797380  22273432   1243972  95% /home
/dev/sda3             29753588  25503792   2713984  91% /data
/dev/sda4               295561     21531    258770   8% /boot
tmpfs                   257476         0    257476   0% /dev/shm

The hardware

Don’t depend on the command line to tell you everything you need to know to ensure that the systems you manage are in good shape. Check them from time to time in person. Look for warning lights and fans that might not be working as well as expected. Make sure that critical systems are plugged into UPS devices whenever possible.

Backups

Also remember that usable backups are an important part of system health. A system that cannot be fully resuscitated after a data disaster is not in good shape. Check your backups regularly to ensure that they are usable.

Wrap-up

Being proactive can help you ward off system problems long before they threaten operations. Periodic health checks can also help you to be familiar with how a system is generally performing and this can help you recognize when a system is undergoing an unusual problem.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.