VERSION 0.1 30.1.2014
This is a draft. Comments and criticism is welcome.
Fault is a component of infrastructure which prevents a system or service from working as planned. A fault may interrupt a system or service or make it behave in an unplanned manner.
Fault management is a process which handles and repairs faults. It works closely with support management which handles customers’ problems and finds solutions to them.
Fault management can be organized in many ways. It may be a permanent team or it may work in system maintenance and improvement when there are no faults. There must be close cooperation with the service desk and fault management process.
The goal of fault management is to minimize the negative effect of faults by repairing the faults efficiently and effectively.
Faults are usually identified by service monitoring. Monitoring can identify faults directly or through event filtering. Monitoring can also predict potential faults than can be treated before they happen. Monitoring can be automated.
Monitoring can repair routine standard faults automatically or manually. These operations are logged but not processed further.
In some cases it may happen that customers report faults to the service desk, which passes the fault ticket to fault management. In those cases the fault management process reports to the service desk.
2.2. Fault identification.
When a fault is observed, it will be logged, classified and prioritized in the fault management system.
Observed faults or failures may be just symptoms of a hidden fault. It is important to find the real causes behind the observed fault. There are several problem solving techniques that can be used in this phase. (A list of examples with links)
Major faults are faults that have a large impact on service quality, security or other aspects of service. The definition of major fault should be agreed and documented. Major faults need to be handled with a specific procedure. Major fault management is normally a service desk procedure.
The result is a documented and understood fault description.
2.3. Fault analysis and evaluation
Fault analysis finds a way or ways to solve the fault. It may be easy to solve the fault with some standard procedure but in some cases there may be different ways to solve the fault and it also possible that there is no solution available. Fault evaluation analyzes the costs and risks of the fault and the costs and risks of fault treatment. It selects the fault treatment method but final decision is done by change management.
The result of fault analysis is an analyzed and evaluated fault description and possibly a change request.
2.4. Fault treatment and restoration of service
Fault treatment applies the selected method for handling the fault. The result may be a permanent solution which removes the risk of the fault repeating, it may be a solution which reduces the risk of the fault for some time or it may be a permanent workaround. All workarounds are registered in the knowledge base and passed to the service desk.
The fault treatment should result in the restoration of the service. Change management will confirm this.
Sometimes it may be necessary to create a temporary or permanent workaround or reparation so that the system or service starts working. Support management finds customer specific solutions and fault management finds general solutions. When a fault has customer impact, the service desk is in charge.
The result is a closed fault.
Faults create negative business value due to lost business and productivity. The service management team must estimate the cost of failures in cooperation with the business units. Fault management will try to minimize the cost of faults