The Patching Team vs Automated Patching

Does a dedicated team to handle your patching offer value for money when there are automated alternatives?

Okay this one is a personal bug bear of mine. The Patching Team. The group of guys (and they are always guys in my experience – one day I’ll have to investigate the lack of female techs out there) who are “responsible” for applying updates or patches to the IT estate. Now I’ve put the word “responsible” in quotes because, in the majority of my experience, once they’ve deployed a patch their responsibility seems to end. Any issues arising from the patching seem to get pushed back to the poor old beleaguered IT support team who have to identify both the cause and a solution.

In this case, when the sum total of human input is to push a button to deploy the updates, why isn’t the patching process automated? If these guys are there to deploy a patch and walk away, is there any point of using an interactive human being?

Now I can hear some people out there shouting “because we need control / testing / scheduling of maintenance” or those lovely “change controls”. Well no. Shush.

To clarify; here I’m talking out the major infrastructures that need regular maintenance i.e. windows updates, VMware ESXi and the major hardware vendor updates all of whom supply robust automated methods of patching. I’m not talking about irregular firmware fixes to critical SAN storage (although the good vendors arguably have similarly robust update systems). I’m also assuming you don’t have the luxury of a dedicated testing or pre-production environment and, with the exception of the odd spare test PC or server, you have to manage deploying updates in the live environment.

So let me answer some of the concerns addressed above.

Control and Visibility

So the question goes something like this “I need to know what is being done to the IT environment to alert stakeholders”. Of course you do – just configure the above systems to send a report of the outstanding updates before they are deployed. Woo. Job done.

Testing

This is a good one. Now 15 to 20 years ago I’d agree that you would avoid updates (or Service Packs in those days) because there was a very good chance they would break something. If you needed to deploy an update for a particular reason you’d make sure it was tested on a non-critical system first.

I’ve heard Microsoft employees freely admit that their testing of updates in the past left much to be desired. It was a very different time and business environment. Lead time on updates, before the widespread internet, had to include manufacturing of installation media and shipping to customers. If a problem was found there was no way to get that media back and change it. New media had to be issued and responsibility to use the correct media was with the end user. That is no longer the case. The big IT vendors now realise that their position in the market is reliant on being useful to their customers and forming long term trust relationships with them. The internet allows rapid deployment, dissemination and some control of update distribution and the related information. Testing updates is now a critical part of the vendors responsibility prior release. The negative publicity of a bugged release can seriously affect the vendors corporate image. So to all you forty-plus-year-olds who cut your teeth in a Windows NT environment (or earlier!) – get over it! We’re not in that world any more. ☺

However, it would be incredibly naive to assume that every update will work flawlessly in each and every unique IT environment in existence and it would be equally naive to assume that each and every update has been tested by the vendor in every unique IT environment in existence. We also need to consider human error – it’s a real thing which can be mitigated but never eliminated completely because, you know, we’re all human and even the big vendors can make mistakes.

So what can you do to mitigate this risk? You can be clever with your update deployment ordering; first to test systems then low risk and redundant systems and with clever use of the update roll-out schedule. More about scheduling below but here’s an example of a clever automated testing plan from the McAfee ePO 5.0 (that’s enterprise antivirus to the uninitiated) best practices guide.

The number one risk with deploying a new AV definition (McAfee call’s them DAT files) is that it will falsely identify a critical system file as a virus and take action that breaks the system (deleting the offending file). Their test solution is to have an environment of test systems that represent your environment (i.e. representatives of your laptops, PCs Servers and common software). New DATs are deployed to the test systems at very regular intervals, and hours or days before the DAT is released to the rest of the estate. The test systems perform full system AV scans on a high frequency schedule. If there’s a virus hit triggered (because these are managed test systems the likelihood of a real virus hit is low so it’s likely a false positive) an email is triggered to tell support not to release the latest DAT to the live production estate. If there’s no AV hit in the desired time window the DAT can even be automatically released. Job done – DATs tested for false positives with little user interaction with minimum testing time required. Please note, I’ve simplified my description here, the real Mcafee solution is far more elegant: see page 120 here.

Now, not every update methodology has quite the automation that ePO can provide but many have the options to create test groups and to deploy updates to desired groups in batches. So use them! For example with WSUS, you can create a schedule to deploy updates to groups; so create 4 groups and you can roll out updates in a controlled manner over 4 weeks.

Scheduling

Make sure you use the full scheduling capabilities using whatever grouping methods you can in your update platform. For example here’s a good order of roll out that attempts to mitigate risk by deploying updates in batches – these can be automated in a schedule so they occur over a period of time to allow any issues to be spotted.

For PCs

Week 1 – Test PCs (depending on your estate these can be live PCs but ones that are not critical – i.e. some IT support PCs or Live Production PCs in areas where there are spare PCs available or good IT support response).
Week 2 – Live Test Site (a site that represents the full range of applications used in the enterprise – it should also have good IT support response if there is an issue caused by the update)
Week 3 – All “even numbered PCs” – Assuming you have a naming convention that ends in a number for your PCs this will hit about half of the estate – but most sites with one PC (i.e. 001) won’t be hit.
Week 4 – All “odd numbered PCs” and any other remaining PCs. By now the updates have had a good deal of testing in the live environment and you can deploy to the rest of the estate with some confidence that any issues have already been spotted.

For Servers:

Always try and deploy updates in a manner that will not lead to a major incident if an update breaks something. Here’s an example grouping for servers. As always the goal is to limit any update issues by deploying them to low risk servers first and moving to progressively. Each layer acts as an additional field test for the next:

1. Test Servers
2. Non critical Servers (i.e. non critical management and monitoring servers)
3. Any server with an automatic redundant partner (i.e. cluster member – remember to fail back services to the updated server to ensure they are working before proceeding with updating the other cluster members)
4. Any servers with manual (non clustered) fail-over partners (it’s probably sensible to update the non live partner first and fail the service over)
5. Other cluster members
6. Remaining even numbered servers (assuming your naming convention has numbers)
7. Remaining odd numbered servers
8. Any critical stand-alone

Change Management

With the above safeguards in place a standard change should be sufficient for regular updates from vendors with a good testing history. Otherwise you could employ a change system where approval is given after the update content is reviewed by a technical specialist, who can then remove any updates from that may introduce extra risk for a more thorough approval and testing process. You shouldn’t have to justify the reason for doing updates at every CAB just when they are being done and who needs to know.

Conclusion

Whilst many aspects of patching can be automated some human input is still required to manage the deployment of updates. With a well thought out process the final deployment of updates can be effectively automated.