During my travels in the IT world in the past few years I’ve noticed a common failing in the IT processes in a large number of enterprises following ITIL. Regular maintenance or support of internal infrastructure is being regularly neglected.
It seems to stem from following ITIL type IT support reporting and monitoring processes within the support team. The problem tends to affect the small to medium enterprise but large infrastructures are in no way immune. The problem is worse if your IT support staff are providing support as a service to external customers.
The problem is this: Regular maintenance of IT assets (or configuration items, CI’s in ITIL speak) is being neglected. It’s being neglected because the management of maintenance has been delegated solely to IT support and no one has set up a framework to monitor and measure the maintenance. Management are blinkered and focussing solely on the metrics for IT support to customers, especially if IT support is a revenue source for the business.
Imagine this scenario: You are a hard working conscientious IT support person. You get monitored on a number of metrics around successful IT support fixes. You’re busy, there’s always another support ticket in the queue. You know there’s a minor issue with a certain internal system. There’s probably a patch for it, if you had the time to look, so you decide to do some research when you have 5 minutes to spare. Of course, that research gets put back because the next spare 5 minutes isn’t really spare as it’s time for a cuppa or a chat with a colleague about how the CEO can never remember his password (and there’s nothing wrong with any that – IT staff are human beings, even when at work!). Even if you did find time, there is the (usually laborious) request for change (RFC) form to complete, submit and justify to the change board (CAB) and who needs that hassle! So time passes and that patch eventually gets forgotten. Until something goes wrong…
I’ve turned up at clients who (knowingly or not) follow this model. In investigating their infrastructure I find LOTS of red. Errors and warnings all over their estate. If I point this out to the support staff they make their (usually valid) excuses and get back to trying to reduce the support queue. I often wonder how this situation can continue without anyone outside of IT calling them on it. I also wonder how their estate keeps running when it is in such a bad state of health.
I’ll answer the second point first: it doesn’t. The estate fails all the time. Small bits of data and productivity are lost. There can even be major incidents. So why doesn’t anyone notice? Because the IT support team fix it. They reactively fix the issues as they arise. Small unresolved issues will rarely ever be escalated by customers, and their incident tickets can be closed with a suitably non-specific cause. Major incidents will be fixed by the support team who are also responsible for reporting on the root cause of the issue, conveniently omitting details that the fix they used was a two year old patch. Most of the time the support team will even get a pat on the back for fixing the issue! You know, that issue that should have been patched two years ago before it caused a day of customer downtime and many hours of IT support overtime costs.
Be honest, if you are a senior manager who’s been part of a critical incident fallout/review, did you do a quick google on the details of the root cause of the incident to check how long that issue had been known?
So what’s the real issue here? Let me be clear, it is NOT the fault of the IT support team and it’s not only that the ITIL deployment is skewed so that the IT Support staff are only being measured on IT Support metrics, ignoring underlying infrastructure maintenance. The critical missing piece of the puzzle lies outside of the IT support team.
Now, in a large organisation it’s easier to separate end user and infrastructure support or even maintenance into separate areas of responsibility with their own monitoring metrics to help avoid the situation above (but don’t get me started on The Patching Team).
In smaller businesses you need only three things to mitigate the situation above. The first is to move the management responsibility of maintenance outside of the support team (the support team can still do the deployment). A good candidate is someone with a bit of technical knowledge who’s not part of the support team but has a vested interest in the infrastructure i.e. a technical architect or technical project manager. It’s this person’s responsibility to identify areas requiring maintenance, raise the RFC, liaise with support for the technical details and champion the change through CAB (with assistance from support if required). Once the change is approved and scheduled it is this person who monitors that the change is completed by the support team. In order to do this we come to the other thing that is required to fix the issue: Auditing.
In order to maintain the infrastructure you require visibility of its status and an audit will gather that information for you. It need not be costly or complicated too, most vendors have some form of free tool that manages updates which can create reports for you; Microsoft has WSUS or SCCM, VMware has update manager, and most hardware and software suppliers have some form of update awareness scheme whether it’s app based (i.e. HP SIM) or email alerts. If there are no automated audit tools available then you may need to resort to a manual scheduled check of the deployed versions in your estate against the manufacturer’s latest version. You may also have 3rd party apps that can do this in a single pane of glass or even integrate into your CMDB.
The final piece of the puzzle are your metrics for evaluating when maintenance is required i.e. security updates should deployed within 30 day of release and for measuring the performance of your maintenance scheme i.e. how many systems have outstanding security updates that have been required 30 days or longer.
Caveats
Now I can see all you senior managers out there who are now worried about this and have already picked out a champion to fix things but please bear the following in mind. Your champion will need time to do all this. Lots of time to set it up and depending on how efficient your auditing systems are a regular amount of time to keep things up to date. Not just the slack time you envision they already must have because you saw them chatting to a colleague for 5 minutes once…. System maintenance isn’t a given and it’s not magic. If you want secure and reliable systems you’ve got to invest in them. I’m sure you already know that preventative, scheduled and planned maintenance is always better and less costly for your business than reactive incidents.
Finally, also please bear in mind in you will still need the IT support team to set up a lot of this. You’ve got yourself into a situation where your IT support staff have all the knowledge and no one is going to be able to hold them to account until some of that knowledge is transferred. I guarantee you that in the first run there will be a few systems that get missed, or worse, deliberately remain hidden because the only person who knows about them didn’t have any incentive, from their perspective, to join in the preventative maintenance party. Hopefully, most of these will be captured and added into the audit schedule over time as part of continuous improvement. Any others will no doubt fail in spectacular fashion 6 months after you let that hard working conscientious IT support person resign!