Quite often, IT teams tend to jump into solution mode without clearly understanding the issue, clarifying the issue, or diagnosing the issue. The lack of a proper understanding and diagnosis can lead to delayed resolution times, increased impact to the business, and poor client experience.
Many times, issues that span multiple domains and teams can present themselves as complex and lead to delays in a proper diagnosis. Issues that exhibit symptoms that can have multiple causes or contributing factors can also delay the diagnosis and resolution times. Leveraging a differential diagnosis technique, where, through the process of elimination, the potential cause can be identified, can help to accelerate the identification and subsequent resolution of the issue in a much timelier manner.
The following are the guiding principles when implementing a Root Cause Analysis Methodology:
Define the Problem
A problem is any deviation from an expected norm. That is, any event resulting in a loss or potential loss of the availability or performance of a managed IT resource or its supporting environment. (This includes errors related to systems, networks, hardware, software, and applications. This can also include problems identified during implementation of (failed) changes. The recognition of problems can come from any point in the environment and can be identified using a variety of automated and non-automated methods.)
Clearly identify and describe the problem. (What was affected (resource name)? What was the impact (i.e., system down)? Who was impacted? When did the problem happen? How long did the problem last?) (1- Need to focus your RCA and preventative actions on specific issues. 2- Need to ensure focus is not on just solving the symptoms but getting to the actual root of the problem. 3- During the root cause analysis multiple problems may appear that should be addressed separately but captured as action items. 4- Restate the problem and resolution in the RCA document to ensure everyone understands what the issue was when it occurred.)
List Presumptive Causes
Presumptive causes are identified at the beginning of the investigation. They are the initial suppositions or thoughts on the root cause of the problem. Thorough root cause analysis may later show they are only symptoms or contributing factors. (Example of presumptive cause: When the cable modem connection is plugged into a workstation, the green light does not appear. The cable is plugged into another workstation, and it works, so the cable can be ruled out. It appears the problem is the network adaptor card which is a hardware issue. However further discussion leads to discover the device driver had been changed when diagnosing a faulty router. Once the setting was revised to the original device driver, the light appeared, and connectivity was obtained. The device driver was determined to be the root cause. The network adaptor and cable were presumptive causes.)
5-WHY Decomposition Methodology
The ‘Five Whys’ is the simplest method for root cause analysis. Take each presumptive cause and ask ‘why’ continuously until you exhaust that line of questioning. (Five Whys is the most used method for determining root cause. The root cause is usually found by the fifth ‘why’ but can take more or less iterations, depending on the problem.)
Note: If there are multiple presumptive causes, you should complete the ‘five whys’ for each one.
Identify Root Cause(s) and Contributing Factor(s)
Root Cause(s), if eliminated or changed, will prevent the recurrence of a given or like problem. Contributing factor(s) alone would not have caused the problem but are important enough to need corrective action to improve the quality of process. They could also be items that made problem determination or recovery more difficult.
For each item identified during the ‘five whys’ decomposition. Ask these questions to determine whether the Root Cause has been discovered or just a Contributing Factor:
If this item is fixed, will it prevent the problem from recurring? If yes, then Root Cause Yes and Contributing Factor No.
Did this item delay restoration of service or recovery? If no, then Root Cause No and Contributing Factor Yes.
Did this item delay problem determination? Or make it more difficult? If no, then Root Cause No and Contributing Factor Yes.
Identify Action Plans
What must be changed to prevent recurrence of each root cause and contributing factor? (Hardware, Software, Procedures, Resources).
Circumvention – Immediate action taken to restore function.
Future Prevention – Additional actions required to ensure the problem does not recur. May require Change Management.
Action plans are used for future prevention. It is possible that the immediate action (circumvention) might be the only action for future prevention, however, if it is deemed a temporary solution and/or other items need to be addressed to eliminate contributing factors, an action plan should be put in place.
Each action item should contain: Description of the activity, Target completion date, Person responsible for implementation.
Communicate Lessons Learned
RCA meetings, Team meetings, Outlook, Team newsletters, Updated procedures and checklists, Data Repositories.