Editors Note: This article is part one in a two-part series, published under Creative Commons License. For those that may not know, Erik Troan is one of the original authors of RPM (Red Hat Package Manager). As every HPC administrator knows writing scripts is part of the job. Erik offers some insights as to why this can lead to unexpected problems.
There are two kinds of people in the world. Those who divide the world into two kinds of people and those who don't.
Okay, an old joke, but I'm clearly the first kind of person. I try and split everything into two buckets. System automation solutions lend themselves to this two-sizes-fits-all mantra, with the approaches splitting between scripting solutions and model-based approaches.
It seems like most people think about scripting as the best approach to automation. Whether it's simple shell (or PowerShell) scripts, attaching scripts to Opsware machine definitions, or running an inscrutable perl command line, scripting is king. Engineers and system administrators tend to break problems down into steps, and scripting is a way of codifying those steps and running them on lots of machines.
Scripting is easy to understand and a natural step, but scripting also has serious problems. It's a great tool when there is nothing else available, but scripts are difficult to write, impossible to test, unverifiable and non-invertible.
Why are scripts difficult to write? There are a few reasons; the most obvious is that you're describing how to get to a new state, rather than just describing how things should wind up. Think about how building architects work. They draw up blueprints which completely describe how the important parts of the new building need to end up. What columns support which beams, where the electrical needs to go for code, and what plumbing needs to be put in for fire safety. The construction crews then decide how to get the steel, wires and pipes into place. Can you imagine if the architect had to write a detailed list of instructions describing how to build a building? Down to what kind of screws to use to hold up the drywall and what kind of drill should be used to put them in place?
Writing scripts for system automation means you have to describe every step. Every time. This forces scripting languages to be Turing-complete; they're powerful languages that can solve any problem. That also means the scripts are impossible to analyze for correctness.
So once you've written a script, how do you test it? Install a machine and run it? Go log into a box and run it there, watching it closely? How do you know the box you're testing it on is a good enough representative of the 1,000 machines you're about to run the script on? Like it or not, machines drift. Configurations change, and software installs change. That handcrafted script has to be able to adapt to every one of these divergent machines. You can test a few cases, but are you really testing it exhaustively? Have you tested the error cases, or will things silently fail? Or even worse, will they fail in a manner which leaves the system unresponsive? Software companies pay a lot of people to test their code under every conceivable situation, and we still wind up with Vista. Does an IT staff test their scripts that carefully? Let's say you got the script written and tested. How do you know that it's doing the right thing when you run it? If it makes a configuration change, can you verify the change was made correctly? Chances are you haven't had to describe the change anywhere other than a whiteboard. So what checks those 1,000 machines to makes sure the script did the right thing? How do you audit the system to make sure the script didn't break the change that the previous script was supposed to have made?
Finally, if you've navigated all of those mines, what if the change was simply incorrect? The script did what it was supposed to do, but it turned out to be a bad idea. How do you undo it? Scripts are, by nature, non-invertible. You can't say "oops, let's just undo that." Instead, you're writing a new "undo" script, and testing it, and (hopefully) checking that it did the right thing. Making non-invertible changes to production systems is crazy.
Face it. Scripts suck.