Americas

  • United States
sandra_henrystocker
Unix Dweeb

Unix tip: Monitor disk arrays with sccli commands

Analysis
Dec 04, 20084 mins
Computers and PeripheralsData CenterOpen Source

Clearly one of the best features of disk arrays is that they can continue working even when a disk has failed. One of the problems, however, is that you might not notice when a disk fails, and thus, fail to replace it in a timely manner. Let’s take a look at what you can do to facilitate monitoring your storage (StorEdge) arrays so that a bad disk doesn’t escape your notice.

First, there are two ways to determine that a disk in a StorEdge array has failed. You might notice that an amber LED on the front of the particular drive has lit up or you can use the sccli commands to view the disks contained in your array and their status.

To start sccli, log into the server to which the array is attached and type “sccli”. You should connect to the device and find yourself sitting at the sccli> prompt. To view the state of the disks on your array, type “show disks”. In the display below, one of the disks is reported to be “BAD”. The particular system was still running and still had one disk in “STAND-BY” mode, so the failure was not an emergency. Still, it’s a good idea to ensure that all disks in the array are working properly to steer clear of failures from which your array would not be able to recover without intervention.

sccli> show disks
Ch     Id      Size   Speed  LD     Status     IDs                      Rev  
----------------------------------------------------------------------------
 2(3)   0   68.37GB   200MB  ld0    ONLINE     FUJITSU MAW3073FCSUN72G  1303 
                                                   S/N 000640B0H8A7    
                                                  WWNN 500000E0130AC3A0
 2(3)   1   68.37GB   200MB  ld0    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q088GS
                                                  WWNN 500000E01076CB30
 2(3)   2   68.37GB   200MB  ld0    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q08ALW
                                                  WWNN 500000E010776B60
 2(3)   3   68.37GB   200MB  ld0    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q089V9
                                                  WWNN 500000E0107729D0
 2(3)   4   68.37GB   200MB  ld0    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000409Q08G32
                                                  WWNN 500000E01078E900
 2(3)   5       N/A   N/A    NONE   BAD        FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000409Q08G5Y
                                                  WWNN 500000E01078EFF0
 2(3)   6   68.37GB   200MB  ld1    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q08FV1
                                                  WWNN 500000E01078DBF0
 2(3)   7   68.37GB   200MB  ld1    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q08FNM
                                                  WWNN 500000E01078CE70
 2(3)   8   68.37GB   200MB  ld1    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000408Q089C1
                                                  WWNN 500000E0107711D0
 2(3)   9   68.37GB   200MB  ld1    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000409Q08G99
                                                  WWNN 500000E01078F6A0
 2(3)  10   68.37GB   200MB  ld1    ONLINE     FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000409Q08G15
                                                  WWNN 500000E01078E550
 2(3)  11   68.37GB   200MB  GLOBAL STAND-BY   FUJITSU MAP3735F SUN72G  1701 
                                                   S/N 000409Q08G06
                                                  WWNN 500000E01078E370

In this array, we can see that all the disks are 72 GB Fujitsu drives. Drives in the same chassis can be different sizes, but should all be running at the same speed.

Depending on the nature of a disk failure, your disk array could be working at reduced speed in order to compensate for the missing data.

The sccli (version 2.1.0) utility provides more than a hundred commands for reporting various aspects of your disk array and is a good tool to get to know. The “show-enclosure status” command, for example, will report on fans and voltages as well as disk channels. The “show led-status disk 2.5” command (where the digits in 2.5 represent the channel and disk IDs) will tell you if an LED is lit:

sccli> show led-status disk 2.5
 (enclosure sn 001363) led-slot-5: off

The “show media-check” command will give you an idea how the array is recovering after the insertion of a new disk. This process takes a long time and can be aborted if you want.

sccli> show media-check
 Ch  ID  Iteration  Status
------------------------------
  2   1  49         95% complete
  2   2  49         65% complete
  2   3  49         30% complete
  2   4  49         30% complete
  2   5  48         53% complete
  2   6  48         31% complete
  2   7  48         28% complete
  2   8  48         30% complete
  2  10  48         28% complete
  2  11  48         27% complete

To collect information non-interactively, you might try putting commands like these in a script. These commands will show the disk output (like the example above) and the logical drives as well.

sccli 

Of course, you're not going to want to see this output every day. What you will want to see is this output when there's a problem. You could incorporate these commands into a script that looks for the "BAD" indicator and sends you email only when an indication of a bad disk exists.

#!/bin/bash

EMAIL="mymail@mysite.com"
SYS=`uname -n`

sccli  /tmp/$$
show disks
show ld
exit
EOF

grep BAD /tmp/$$ >/dev/null 2>/dev/null && cat /tmp/$$  
    | mailx -s "BAD disk reported on $SYS" $EMAIL
rm /tmp/$$

Don't let a disk failure go unnoticed just because your system can keep on running!

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.