Tuesday, July 27, 2010

quality control in your network

(disclaimer: i've never worked in quality control, but this is my view of it as someone who has had to work with QC)

while i sit here at 3 in the morning waiting for a server daemon to dutifully seg fault leaving me to continue debugging, i reflect on how quality control is lacking from so many networks both large and small. a single oversight *can* mean the difference between your company losing money or going under, so you must be aware of any potential problems at all times.

first of all, what is quality control and why do you need it? well, chances are if you do your job halfway right, you're already doing it. quality control is basically double-checking that the methods you use to do your job are correct. it doesn't verify that the final product is good; it's more like, tripwire in procedure form. it's making sure that things work the way you expect them to. you are already doing it when you verify what development libraries are installed. when you run unit tests against your software. when your change management system verifies a user is allowed to commit that particular piece of code, or restart that service. it's checking the tapes to make sure the backup robot is functioning. it's verifying configs are written properly on the router and the updated ones are regularly saved in version control.

typically you don't need much quality control in the average network. some product development may require strict control and observation of policies and procedures, which is usually only reinforced due to the risk of random audits or inspections. depending on your environment you may be required to do very little or no quality control at all. but i'd like to tell you about the quality control you should be doing.

the quality control people aren't usually technical people. a lot of the time they'll work with a team member of whatever they're checking out, ask questions and make notes. the first big formal procedures don't include everything. usually details get hashed out while the QC engineer talks to someone (a dev in this case) about what they do and how to check that what they did worked.

the basic principles you should keep in mind when applying quality control to your network are as follows:
1. Keep It Simple, Stupid. it doesn't have to be verbose or complex. be flexible. be easy.
2. it should be possible for someone to check the work of the quality control engineer(s).
3. you don't want to define how everything works; only how to tell if it's working as expected.
4. your goal is to make sure there are no catastrophic failures. you don't have to account for every blip along the road as long as the road is open.
5. start with the big things and move down once all the big things are covered. close off those single points of failure and move on to the other pressing issues.

hopefully this post will help to give you an idea of how you can apply quality control to your network to get an improvement in the overall quality of service you provide. half of this is just making sure things works right, and then the other half is reviewing that there are records that it's been used properly in the past. here's some stuff you can do.

developers
double-check that your software is being created correctly. check that the libraries on your development boxes match up with what's going into QA or production. use unit tests on your code. make sure everything goes into version control *before* it ever hits QA or production, and make sure you know who made what change and why. make sure the method of deployment can be reversed at any time. make sure you follow change management procedures when necessary.

sysadmins
double-check that you've confirmed with everyone before you push a new piece of software to QA or production, and that you can roll it back when necessary. so check that your change management is working. it's good to have a list of the major and sub-major software that different development teams rely on (usually libraries) and get a change-management approval before ever pushing this stuff out. do it early so devs have time to test their shit with the new software. make sure your backups work correctly. you should be able to confirm logs and destination files to ensure the backups are going well regularly. if those or any other automated process fails it should generate an alert, and you should be able to verify those alerts are going out as expected (did /var fill up and is sendmail unable to work now?). make sure all security patches are applied in a timely manner. make sure all service-monitoring systems are working, and that failover of critial systems is in place and works as expected. make a list of all critical infrastructure and make sure all of it has hot-spare failover systems waiting in the wings. provide for methods to remote troubleshoot in the event of total internet or system collapse. make sure any network gear you depend on also has hot failovers that work.

there's more implementation-specific details you sometimes need to get into with QC. i want to get more into how to begin making procedures for these systems but it's way past my bedtime. will continue when i am not so sleepy.