DWS Service Desk Rescue Series: Improving MTTR

July 11, 2023

DWS Service Desk Rescue Series: Improving MTTR

‍

The Mean Time To Restore (MTTR) Analysis process focuses on understanding the effectiveness of restoring a disrupted service as quickly as possible to minimize the impact on the business. MTTR is the measurement of the average time it takes to recover from a failure. This includes the full time of the outage; from the time it was reported to the time it was operational again (resolved in the system). The MTTR Analysis aims to understand the historical behaviour and identify areas where the process is behaving as designed and areas where improvements can be made.

MTTR is calculated by measuring in hours the difference between the time the failure occurred (Opened Date) and the time the failure was resolved/restored (Resolved Date).

‍

The Objective of The Process:

The Objective of the MTTR Analysis process is to ensure that the response and resolve times to identified failures are within expectations and improvement opportunities are continually being looked at to minimize failures.

Obtain an understanding of MTBF (Mean Time Between Failures), MTTR(Mean Time To Repair/Recover/Restore/Resolve/Respond), MTTA (Mean Time To Acknowledge), and MTTF (Mean Time To Failure)
Establish data collection and calculation methods
Identify the Impact (Outage Duration x Number of Users x Labour Rate)
Identify where the problem lies in the process
Address issues with Monitoring and Alerting
Address delays in restoring service by teams
Address issues with response times
Address delays between failure and alert
Address issues with alerts taking longer than they should to get to right person
Address issues with the process of diagnostics
Address issues with delays in the repair process

Sample List of Benefits:

Accuracy of measurements
Identification of process gaps
Improving the effectiveness of Monitoring and Alerting and EventManagement
Improved Response Times
Improved identification and responsiveness to alerts so failures could be avoided
Improved Resolution/repair times
Improved diagnostics

Sample List of Observations:

Skills, capability, and capacity issues on technical teams
Issues with Monitoring and Alerting
Issues with Performance and Capacity
Issues with MIO Coordination
Issues with Repeat and Chronic issues
Issues with a lack of proper Problem Management process

Sample List of Recommendations:

‍

Strengthen the MIO team’s leadership and technical skills
Involve Architects during Client Impacting Events (Major Incident and MIO Invoked = TRUE)
Implement Business Recovery Managers
Conduct Skill Gaps Analysis of technical teams
Ensure On-call process is up to date and compliant
Ensure Tower leadership on all CIE calls
Ensure Problem Management has visibility and is effective

‍

Sample List of Areas to Probe:

Collect and review the MTTR YTD performance with Opened and Resolved/Closed flow rates
Identify any patterns, signals, anomalies
Identify if there are issues with MIO skills, capability, and capacity
Identify if there are issues with response times
Identify if there are issues with restoration times
Identify if there are issues with coordination on MIO bridge calls
Identify if there are skills and capabilities issues
Identify if there are capacity issue
Identify if there are issues with vendor engagement
Identify if there are issues with L2/L3 engagement
Identify if there are issues with recovery procedures
Identify which teams have contributing to higher MTTR
Identify which agents are contributing to higher MTTR
Identify which Applications/Infra components are contributing to MTTR
Identify via a time series analysis of MTTR varies by time of day, day of week, week of the month, month of the year
Identify if MTTR is impacted if issues span multiple teams
Identify if MTTR varies on repeat/chronic issues
Identify if there are issues with monitoring and alerting
Identify if there are issues with event management