• Articles 07-14-2010

    I believe that a computer system should be stable. In order for it to be stable it must first be predictable. Although nothing can be 100% predictable, I will define predictable as in “I predict that a flight that takes off from Atlanta to Tampa will arrive X hours later”. This level of predictability does not happen by magic. Those planes are serviced on a rigorous schedule, and that service is always(?) done on the ground.  I don’t really know of any situations where aircraft had to be serviced while in flight (production mode). The opposite approach to maintenance, “If it ain’t broke don’t fix it” leads to many production interruptions that are sometimes very avoidable. Luckily, sysadmins do have the luxury of emergency maintenance, but let us all shoot for the skies and try to think of our systems as if we did not have that luxury.

    Let’s face it, your system will become unpredictable at some point. How the system is handled in this state makes a huge difference in my mind. Many common practices I have witnessed in use actually reduce predictability. It is generally not a good idea to make ANY changes to a system that is acting in an unpredictable manner. The fact that it is doing something unexpected means that something is happening under the hood that you don’t yet understand. Computers have no ‘mind of their own’ and definitely do not have ‘mood swings’. A lack of understanding can lead to fear and superstition, and will usually end up in the practice ’sysadmin voodoo’.

    The ‘uninstall/reinstall’ ritual is often practiced. It may work on some operating systems, but for applications that run on UNIX this usually causes more problems than it fixes and puts the system into a state that is even harder to understand.  How would you like if your doctor uninstalled you in attempt to cure some disease that he didn’t have the right information in order to diagnose? Well, you wouldn’t care because you would be dead. Although there is not an equivalent of dead for a computer system, you can see that this method leads to an unnecessary amount of work, keeps the system in a down state longer, and the reinstall may actually result in a configuration that is different from what was there before. Why would anyone do this?

    Another voodoo ritual is the the ‘wheel-of-fixes’ . This involves making a best guess as to what will correct the problem based on the symptoms. Would you like to be operated on that way? Oh, I will remove your tonsils now because your throat is sore. This approach can end up breaking something that was working, introduce new problems, and may not even solve the original issue.

    So how you you handle an unpredictable system? Troubleshooting. There is a saying, “Troubleshooting is not so much about knowing what it is you need to fix as it is about knowing what NOT TO FOOL WITH!”. It sounds silly at first, but if you think about the question it raises it starts to make sense. How can you tell what to fix and what not to fool with? Well, it means you must have the right information at your disposal when you are troubleshooting. Doctors use very specialized equipment and methods to troubleshoot, and so it should be with sysadmins. Without the proper tools and methods your job will be a nightmare and you, too, might resort to occult practices.

    Over the years I have come up with a system of tools and methods that WORKS, and works well.
    I call it VIDCAPS (Visibility, Intelligence, Documentation, Communication, Automation, Processes, and Standards):
    V: visibility. Do you know the current running state of your systems? Let’s face it, you won’t drive your car if the instrument cluster was not there, would you? You would not be able to tell how fast you were going, if you were about to run out of gas, if the engine was overheating, etc, etc. I will call anything that provides real-time information to you a dashboard, so ‘top’ is a dashboard for the purposes of this discussion. With proper visibility you will be able to better predict what your system would NOT do next. I find it interesting that anyone would operate any type of this equipment without the proper instrumentation.

    One of the things on your car dash is an odometer. You look at this to tell when to change the oil, although the newer models will tell you when service is due. BTW, I want automakers to design the thing to drive into the shop by itself and have a robot service it :) With the proper visibility you will be able to know when certain maintenance tasks need to be performed.

    One challenge of using the built-in dashboards is that they are generally on the machine itself. What if you have bunches of servers? Who has the time to read an email from every system every morning, or login to each control panel to see what is going on?
    Another problem is that the dashboards provide general information and they may not be customizable. Besides, it is not a good practice to mess with system-supplied tools. You may at some point want to see how many of a certain process or network connection type are active.
    Some cars have a tachometer which is nice, but most drivers don’t really care about engine RPMS. The existing dashboards might be ‘cluttered’ with stuff you really don’t care about.
    Also, most system supplied dashboards give point-in-time data only. What if you wanted to see a graph of your /data mountpoint over the last month?

    There are often business-specific metrics that you may want to have available so I look for every opportunity to add a new gadget to my dashboard, especially when I’m troubleshooting a certain problem and I’m looking for a certain piece of information that is not already on a dashboard.

    The answer is Monitoring. For cost-conscious businesses, there are several free monitoring tools. Hyperic is one I have used and it works very well. They have an Enterprise version that offers better features like the ability to alert on a combination of metrics, but at least try the free one to see how much of your needs it will meet. Another thing I like about Hyperic is that it is very customizable. You will most likely have very business-specific metrics that you need to monitor. I will be writing a separate article on custom plugins for the Hyperic server.

    An import part of monitoring is Change Assessment. Do you know what files are being changed on your system on a day-to-day basis? Tripwire Enterprise is well worth the money, because on production systems you want to know exactly what has changed. If setup correctly, Tripwire will also ‘version’ the changed files, so you will not only be able to see what about a file has changed, you will have the option to put it back the way it was before. Change Assessment and Change Control (to be discussed in the Communication section) need to tie together so you can ‘reconcile’ the changes. Before making any changes, submit a Change Request. Somewhere in the Change Request should have a list of files that you intend to modify, and, if it is possible for you to predict, the MD5 or SHA1 hash on that file after it is deployed to the server. Next day when the change report comes in from Tripwire you can match what the servers saw to what you intended to do.

    Although more related to Documentation (discussed below), make sure Tripwire is saving the content of your production-tested configurations. Out of the box, Tripwire may not be collecting and versioning the information you may need to quickly restore your system back to the Last Known Good state. Tripwire has a database module that can version database objects, and it is really not that expensive. I highly recommend this add-on for any production database.

    There is an open source version of Tripwire, but for production systems I recommend using the real deal.

    I: intelligence. So you have a dashboard in your car. There is a little needle thingie that gives the engine temerature. It is supposed to stay under the red band. My wife has driven her car until smoke started pouring out from under the hood, because she did not notice that needle thingie creeping into the red. Well, it is because her car does not have any other means of communicating engine temperature. In her defense, I did once run out of gas at 4AM! Most of the new models will start ringing a dingbell and lights will start flashing on the dashboard. This is Intelligence. Hyperic allows you to set thresholds on a metric, and can generate an alert when that metric is exceeded. Even the free version permits very sophisticated rule creation, although it lacks the ability to set an alert based on a combination of metrics.

    This will improve both stability and performance because you will be able to head off problems before they bring the system to a crawl, or worse yet, causes the system to crash.

    D: documentation. What is documentation? Is it just having an article describing something, or does the accuracy of that description make it Documentation? When performing a build or install, keep notes as you go along. When you’re done, create a document listing the steps, then  go back and perform the build again, using that document. This way you can be certain that doc is valid. Although existing documentation may not be 100% accurate, it is still a good thing. Try as much as possible to validate any existing documents, updating as necessary.

    Code Versioning falls into this category. I use subversion and websvn. Although there are enterprise code visualization tools, websvn works well for small code files, and it is free. I always put the SVN checkout link in my scripts as well, so when someone else sees it on the server they know where to find it in the repository to make updates if necessary. It would also be a good thing to create a process (to be discussed shortly) whereby the scripts are not modified on the system itself, but checked out of SVN, modified, tested and committed, then deployed on the system.

    C: communication. Have you ever had someone do something that you wish they had asked you about first? Make sure that all persons responsible for a system, and all users of that system are informed of ANY changes you are going to make. Change Control handles this piece of the puzzle. I found an excellent free web-based tool Brage that works very well.

    I have a habit of running ‘w’ as soon as I log into a system anyway. I want to see who else is there. I can communicate with other users on the system using ‘write’ so we don’t step on each others’ toes.

    Communication to me also means doing what you say, when you promised, and in EXACTLY the same way you told your users. I hate unpleasant surprises, and for that reason I do not impose it on others.

    A: automation. I tell folks that I’m a UNIX Systems Administrator and they say, “so you work on computers”. I don’t see myself as someone who “works on computers”. I have computers that work for me. I look for every opportunity to script things so I don’t have to do them. There are several benefits of automation. Firstly, I’m lazy. I prefer not to work, so if I can make the system take care of itself, I can sit back and drink coffee (just kidding :) ). Secondly, I make mistakes, so I don’t really trust myself to perform a task that might involve several commands in a row. What if I miss a step or mistype a command or argument? Third is that I get bored and tired easily. I don’t really care to do the same thing over and over. Fourth is I need to sleep. Who wants to be working at 3AM? Fifth is that I forget things often. Didn’t remember your wife’s birthday anyone?

    By automating tasks that can be automated you will dramatically improve manageability by having more stuff run in maintenance windows, and of course, get more sleep :)

    Automation, however, has its perils. Understand first what is required, create a plan to achieve that goal, keeping in mind the things that can go wrong, then start to script. Always test the script thoroughly by simulating every condition that will cause it to branch :)

    P: Processes. There are some things that you just can’t automate, so a person has to do it. Well, do you do it the same way Joe does? How about Harry? Before you create a process, you should ask, “Is there a recommended way of doing this?” I will address this next in talking about Standards.

    Using the same methodology as automation, write out a ’script’ for the process. Create a checklist for this process. This will ensure that everyone does the same task the same way every time. This will improve the manageability of your systems because every piece of every system that is supposed to be the same will be the same, and various iterations of that task on the same system will result in a predictable outcome.

    S: Standards and processes are somewhat alike, but different enough in my mind to warrant a different discussion. Standards, Best Practices, and Vendor Recommendations are all lumped together here. The opposite to this approach, doing-it-your-way puts the systems in a state where it is difficult for anyone to understand. Although this practice may not be meant to cause harm, it usually does. I spoke on this above: don’t create a process for which there is already a well established standard.

    Besides the manageability problems, following the established standards will help the security and stability of the systems. It is often the case that businesses have ’strange’ problems on their systems that can easily be corrected by following one or more established standards.

    Conclusion:
    So why did I not mention backups? Well, this was a discussion on manageability, security, stability, and performance.  Besides, if you are not already performing backups you must not care enough about your system, so VIDCAPS is going to be a waste of your time!

    Look out for an upcoming articles on Hyperic installation, setting up alerts, and writing custom plugins.

    External Links
    Human Side of System Administration

    Sysadmin Voodoo

    Parm Patram is a 10 year Linux/Solaris veteran with a proven track record of improving manageability, stability, security and performance.

    Posted by admin @ 6:14 am

  • Comments are closed.