DWS Service Desk Rescue Series: Improving MTTR

July 11, 2023

DWS Service Desk Rescue Series: Improving MTTR

The Mean Time To Restore (MTTR) Analysis process focuses on understanding the effectiveness of restoring a disrupted service as quickly as possible to minimize the impact on the business.  MTTR is the measurement of the average time it takes to recover from a failure. This includes the full time of the outage; from the time it was reported to the time it was operational again (resolved in the system). The MTTR Analysis aims to understand the historical behaviour and identify areas where the process is behaving as designed and areas where improvements can be made.

 

MTTR is calculated by measuring in hours the difference between the time the failure occurred (Opened Date) and the time the failure was resolved/restored (Resolved Date).

 

The Objective of The Process:

 

The Objective of the MTTR Analysis process is to ensure that the response and resolve times to identified failures are within expectations and improvement opportunities are continually being looked at to minimize failures.

 

  • Obtain an understanding of MTBF (Mean Time Between Failures), MTTR(Mean Time To Repair/Recover/Restore/Resolve/Respond), MTTA (Mean Time To Acknowledge), and MTTF (Mean Time To Failure)
  • Establish data collection and calculation methods
  • Identify the Impact (Outage Duration x Number of Users x Labour Rate)
  • Identify where the problem lies in the process
  • Address issues with Monitoring and Alerting
  • Address delays in restoring service by teams
  • Address issues with response times
  • Address delays between failure and alert
  • Address issues with alerts taking longer than they should to get to right person
  • Address issues with the process of diagnostics
  • Address issues with delays in the repair process

 

Sample List of Benefits:

 

  • Accuracy of measurements
  • Identification of process gaps
  • Improving the effectiveness of Monitoring and Alerting and EventManagement
  • Improved Response Times
  • Improved identification and responsiveness to alerts so failures could be avoided
  • Improved Resolution/repair times
  • Improved diagnostics

 

Sample List of Observations:

 

  • Skills, capability, and capacity issues on technical teams
  • Issues with Monitoring and Alerting
  • Issues with Performance and Capacity
  • Issues with MIO Coordination
  • Issues with Repeat and Chronic issues
  • Issues with a lack of proper Problem Management process 

 

Sample List of Recommendations:

  • Strengthen the MIO team’s leadership and technical skills
  • Involve Architects during Client Impacting Events (Major Incident     and MIO Invoked = TRUE)
  • Implement Business Recovery Managers
  • Conduct Skill Gaps Analysis of technical teams
  • Ensure On-call process is up to date and compliant
  • Ensure Tower leadership on all CIE calls
  • Ensure Problem Management has visibility and is effective

Sample List of Areas to Probe:

 

  • Collect and review the MTTR YTD performance with Opened and Resolved/Closed flow rates
  • Identify any patterns, signals, anomalies
  • Identify if there are issues with MIO skills, capability, and capacity
  • Identify if there are issues with response times
  • Identify if there are issues with restoration times
  • Identify if there are issues with coordination on MIO bridge calls
  • Identify if there are skills and capabilities issues
  • Identify if there are capacity issue
  • Identify if there are issues with vendor engagement
  • Identify if there are issues with L2/L3 engagement
  • Identify if there are issues with recovery procedures
  • Identify which teams have contributing to higher MTTR
  • Identify which agents are contributing to higher MTTR
  • Identify which Applications/Infra components are contributing to MTTR
  • Identify via a time series analysis of MTTR varies by time of day, day of week, week of the month, month of the year
  • Identify if MTTR is impacted if issues span multiple teams
  • Identify if MTTR varies on repeat/chronic issues
  • Identify if there are issues with monitoring and alerting
  • Identify if there are issues with event management

Signup to read full articles

Ready to listen to what your data is telling you?

Book A Consultation

Subscribe to our Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.