Management by facts sounds fair and numbers look like facts. Unfortunately it is hard to get right numbers. IT services are complex and difficult to measure but the processes are fairly easy to measure. Modern tools produce a wealth of reports based on tickets, phone statistics, web clicks and visits etc. It is quite easy to fall in the trap of believing the numbers. Measuring things tend to be more difficult than we assume and therefore it is easy to misunderstand the results. There are a lot of important things that cannot be measured accurately and it is a major mistake to try to manage by numbers only.
A car’s dashboard shows the speed, engine revolutions, engine heat, gas gauge, outside temperature and the odometer which can be replaced by some other information. None of these metrics are enough to make any other management decisions expect when it is time to fill the gas tank.
1. Measuring service
Service consists of a service proposal, service system and service events. It is important to understand what element of the service we are measuring. A lot of typical measurements are based on discrete service events.
IT service metrics can be based on
- direct observations of events,
- documented performed activities and
- measured value of the service.
For example a customer call is an observed event. A ticket based on the answered call is an activity.
Customer satisfaction is based on customers comparing received service to what they expected. Expectations are based on many things but the service proposal is one of them.
2. Different metrics
Measuring means labeling the observations with some sort of scale.
There are four scales that can be used: nominal, ordinal, interval and ratio.
- Nominal scales only label objects, a change request can be
- Ordinal scales but things in order but the distances between the values are not known, for example in the progress of an approved change the steps may represent minutes or months, depending on the type of change
- under planning
- under execution
- under review
Another example is customer satisfaction where the distances between steps can mean quite different things
- very dissatisfied
- very satisfied
- Interval scales have a meaningful difference between numbers but there is no zero which means one cannot calculate ratios. A classical example is temperature. 11 C is 10 degrees warmer than 1 C or 51.8 F is 18 degrees warmer than 33.8 F. The result will hold at any starting temperature. The ratio of temperatures is meaningless. In Celcius it would be 11 times warmer and in Farenheit the ratio would be 1.5.
- Ratio scales is a true scales, number of units, length, weight are examples.
Event based metrics tell us for example how often something happened and there can be also other event based data like time, volume, money etc. Some customer actions can be measured directly, for example web analytics measures clicks and telephone system records calls. Many systems collect logs and some of them report the events continuously. A telephone system can record queue times, abandoned calls, answered calls, call lengths etc.
Measuring direct customer events can be very useful. For example a game developer can see how many times players do some specific action in the game. If some stage is too difficult, it may cause players to abandon the game. In-game sales is a very specific activity which measures the value directly.
Unfortunately event metrics can be misleading. For example many people may try to open some system and fail at it, or abandon the system after a brief visit. In that case the number of hits would be a bad indicator for the value of the object.
It is not possible to measure the success of IT service management based on events only but event data can be valuable when available. Events can often be measured in ratio scales.
Activity means doing some task or procedure. Replacing a broken component or updating software are activities.
Often activities need to be measured indirectly. The process models we use in ITSM, try to turn everything we do into measurable activities by using tickets. A ticket records each activity. This makes activities quite easy to measure. Opening and closing incidents are measurable activities. Changes can be measured by counting change records. However the number of records tell little about the reality. For example one change can mean five minutes or five days of work.
Almost all process metrics measure activities, usually these are measure indirectly, via the ticketing system. The simple logic equates value with the activities. Activities are work which is done for the customer. More activities lead to more value.
In some cases this is true. Doing something is better than doing nothing. A lot depends on the way how the service organization is working. Some organizations desperately need more control and for them the activities are a good indicator.
Some organizations live in a continuous disaster. Services fail often, and people fight to keep them running. Customers can be used to the situation and are happy if really bad disasters are avoided. This is typical with new services in prototype mode.
When the organization achieves control the chaos has been tamed and services work. One key ingredient in beating the chaos is usually a rigid control of changes. The down side can be that there is no flexibility in the services. Customers need to accept the service as it is delivered but are happy about the stability if they still remember the chaotic stage.
The next phase is when the services are stable and at the same time are able to adjust to changing requirements and situations. The service can handle continuous change and is able to understand the customers’ needs. Processes and process metrics seem to be less important.
According to popular maturity models the ideal stage is where the management is able to measure, control and improve processes, i.e. the control stage. It is a major mistake to think that processes are a goal; a process system is a tool. Excess focus on process metrics can be the main reason why many IT organizations fail. Processes are a useful tool to transform a service organization from chaos to control but less useful when trying to reach the adaptive stage.
The problem with activity based measuring is that it is quite difficult to separate useful activities from busywork. Bureaucratic organizations are ingenious at creating useless activities to keep themselves busy. Having an efficient and effective process in support doesn’t tell anything about the quality of the service it provides. For example a bad service desk might handle 50 incidents per person per day, solve 80 % at first contact and close 98 % within SLA limits while a good service desk might handle 20 incidents per person per day, solve 60 % at first contact and close 70 % within SLA limits. The trick would be that the good service desk has been able to eliminate 60% of the causes for the incidents and is continuously improving service, not the process.
Value is a complex concept. Robert Falkowitcz has written an interesting analysis of it in http://www.3cs.ch/is_service_value_really_delivered/
In business world, value is usually money but often in an indirect way. Direct profit comes usually from new innovations in system design. A new information system may increase sales, cut costs or even do both. IT service management does not create new systems but can save money by preventing outages and other disruptions; it can also diminish lost work time by restoring service as fast as possible.
Service value is usually the result of the outcomes of service events; it is the reason why people use the service. IT value typically comes from automating and simplifying tasks. In most cases IT systems are a must, manual operation is not a viable option. In these cases the automation alone does not create any added value. There can be extra value in lack of friction, i.e. the system is easy to use but powerful and flexible but reliable. By friction I mean all overhead which is caused by using the system.
One important source of friction comes from the complexity of IT systems and services. The users do not know how to do things, they make mistakes and they do not use the systems in an optimal manner.
It is quite hard to measure the value of different components. A meal can be a fantastic experience but it is impossible to pinpoint the exact value of each component of the service. In the same way, the IT must be working ok if the company is a success but it is not easy to measure IT’s contribution exactly.
Bill from sales is visiting an important potential customer and learns that the customer has an emergency situation and that they will buy 1000 units for 10 M$ if Bill can promise delivery next week. Bill needs to make sure that there are enough units available so he tries to get in the ERP system to check the status. Unfortunately he is not able to get in the system. He calls the service desk.
Scenario A. The service desk answers in 20 seconds. The agent asks Bill to give the laptop’s CI id and creates a ticket. The agent checks that the ERP system is up and that there are no networking problems, which means that the incident is about a single laptop which gives it a low priority. Then the agent instructs Bill to boot the laptop but this does not work as the laptop has crashed and does not respond to any command. Finally after consulting the knowledge base, the agent instructs Bill to remove the battery from the laptop to cause a restart. This works and now Bill needs to log in, set up the VPN connection and then start the ERP application. The incident is resolved, the agent closes the ticket and notices that the low priority laptop incident was solved on 14 minutes which was within SLA target.
Meanwhile the customer watched Bill’s struggle with the laptop for awhile but then checked the situation with another supplier and ordered the units from them.
Scenario B. The service desk answers in 20 seconds. The agent listens to Bill and asks what information he needs. Bill explains that he needs to know are there 1000 units available in the warehouse. The agent checks and reports that there are exactly 1250 available right now. Bill informs the customer who decides to buy all the available units for 12.5 M$. Bill asks the agent to mark the units as reserved.
Now scenario A is perfect from the process metric point of view. Everything goes as planned and the incident is solved fast. Scenario B is far from perfect. There is no incident ticket. But in scenario B, the company makes 12.5 M$ by selling unsold units which were clogging the warehouse.
If it were this easy to measure the value of IT activities, running IT services would be far less complicated. Of course in real life things are not so clear. Value is very difficult to measure.
IT is an enabler and a cost at the same time. The value comes from the ratio of saving vs. cost. The best source for this information is the user. If the users have a choice, they will not use a system which does not create value for them. User behavior and opinions are the best ways to measure the value of an IT service.
3. Setting goals
Somebody wrote that any metric can be destroyed by using it as a goal. I agree with that. (Sorry, I don’t remember who wrote it and also not the exact words so I cannot find the source).
It is good to have measurable goals, the problem lies in finding the right way to measure the goals. The problem with activity metrics is that they measure activities not outcomes or value.
One specific class of goals is the service level agreement, which state measurable limits or target values for various metrics. For example availability must exceed some limit and incidents must be solved within some set time. Both are useful metrics in the sense that any adverse trend or change is worth investigating. It is important to understand the causes for the changes.
For example, availability SLA might
- fail because there have been maintenance work done over weekends and the impact for business has been nonexistent.
- be ok while there was a short break at peak time which caused a major business loss
Incidents SLA limits
- may fail because the IT staff is temporarily overwhelmed with work
- be ok because the IT staff deflects incidents by asking for a lot of unrelated information from the customer and keeps the ticket on hold as long as every piece of information has been received
The problem with SLA targets is that they are difficult to define. The term availability is not clear and neither is the solution time. All metrics can be defined in many ways and as the IT is responsible for defining these, it usually leads to complicated definitions which favor IT and result to meaningless values from the customer’s point of view.
There are real life examples where the service provider has prioritized the SLA targets over the real customer needs. The service analysts might spend their time in handling non critical issues because their arbitrary SLA limit is due while there is an active top priority incident on but which still has a few hours left in the SLA clock.
Activity targets can lead to unwanted results as the staff can control their activities. Old saying states: “You get what you measure”. The organization may start gaming the system by trying to produce excellent results while ignoring other factors. One way to prevent this is the use of several metrics. For example increased speed may lead to lower accuracy and vice versa. Setting goals in both directs people to find a preferred balance but it can be also confusing for the staff who doesn’t know how to act.
4. The problem with some common metrics
Here is a list of typical metrics for a service desk. My comments analyze the value of the metric.
| Percentage of phone calls to service desk answered within XX seconds
||Old, telephone era metric, limited value. Can be harmful if leads to over-emphasis on telephone communications.
| Percentage of phone calls to service desk abandoned before they are answered
||Same as previous. Abandon rate has also limited value as it depends on user behavior.
| Percentage of incidents where user contacted the service desk to ask for an update
||Useful, this is a measure of waste in the operation. People should not need to chase their requests.
| Percentage of incidents that were reopened by the user after being closed by the service desk
||Questionable because the SD can open a new ticket instead of reopening the old one. Or if the users are allowed to reopen old tickets, they will do it even if the issue is different.
| Percentage of incidents resolved within agreed SLA targets
||SLA targets have limited value as they are hard to define. SLA goals can lead to unwanted behavior when the service provider tries to reach the targets and ignores customer needs.
| Percentage of incidents resolved using web-based self-help
||Difficult to measure. The use of the self help tool does not prove that it solved the case. For example a user might try to use the tool unsuccessfully five times before getting help from the colleague.
| Percentage of incidents resolved during the initial customer contact
||This is an old, telephone era metric and it has limited value. A customer may try to use self help, try to contact with chat, open a ticket and finally call.
| Percentage of service requests fulfilled within agreed SLA targets
||Can be valuable as SLA targets are more meaningful with simple orders.
| Percentage of service requests fulfilled using automation with no manual steps from IT staff
||How do you measure fulfillment. People may try to use automation but fail many times.
| Percentage of users giving a score of 4 or 5 on post-incident satisfaction survey
||Never use measures like this, they destroy valuable information. All changes in the relative numbers of grades are meaningful.
| Increased satisfaction with service desk on annual customer satisfaction survey
||This has limited value. Customer satisfaction is measured on an ordinal scale where intervals are meaningless.
5. How to use metrics
The key to successful use of metrics is the understanding of the mechanics of a metric. One has to know what things affect the metric and often this understanding can be gained by following several metrics at the same time.
Too many metrics can confuse and misleading metrics are harmful. All reports create non-productive work in creating, gathering, analyzing and reporting the data. Some research firms create high quality reports based on worthless data or bad analysis. Do not trust any metric which you do not fully understand. Beware of complicated “indexes”. Avoid activity goals.
Discussions with relevant stakeholders and a careful analysis of current situation should reveal improvement targets and critical elements in the current service. These may offer some measurable goals. Some of the goals may require change and some goals may require keeping up current levels. Goals should be directly related to value and based on reliable metrics. In many cases it is just not possible to measure the contribution of a single part to the overall result of a large organization. Generally it is better to use shared goals instead of individual goals.
A car dashboard has one value metric: the odometer measures the diminishing value of the car. When buying a car one should not trust it blindly, it can be manipulated like most metrics. The real value of the old car is what somebody is willing to pay for it.
Filed under: Yleinen | 1 Comment »