DWS Service Desk Rescue Series: Service Desk Analytics As a Service Outline

July 11, 2023

DWS Service Desk Rescue Series: Service Desk Analytics As a Service Outline

I previously posted about an approach to conducting IT Operational Analytics using ITSM (ticket) data. My hope was that you would focus on the Key Client Criteria section and the Approach.

The Key Client Criteria comes from 30+ years of hands-on IT Operations experience working at all levels and across the globe.

The Approach comes from working with 100+ clients from all over the world and fine-tuning the approach.

This is the secret sauce.

Now, I am sharing a Service Desk Centric Approach. Within the world of IT, the Service Desk is the single point of contact for all services a company provides to its employees. Imagine when your laptop breakdown, who do you call? If you are having issues with your printer, VPN, wifi, Monitor, etc. Whom do you call? You would call the Service Desk/Help Desk.

Similarly, this approach can also be applied to other Contact Centers (Contact centers are centers that provide Service Desk, In-bound Call Centers (where you call them), and outbound Call Centers (where they call you). The challenges are similar if not the same.

I have also added the Operational Management Reference Framework. This is your operational bible. I will touch on this in future posts.

Take a look and see if it helps you with any work you are doing.

No alternative text description for this image

Assessment Questions:

  1. What are the top ticket drivers?
  2. What is the time series PBA Volume view of the tickets – Year/Month/Week/Day/Hour?
  3. What is the time series PBA MTTR view of the tickets – Year/Month/Week/Day/Hour
  4. Are there any observable SPC patterns in the MTTR data?
  5. Are high MTTR times due to agent, requestor, KB, etc. issues
  6. Are high MTTR times related to same call driver? same agent? same requestor
  7. What Users/Country/Department are generating the tickets
  8. Who is resolving these tickets (Resolver Name/Group)
  9. What is the priority of these tickets
  10. How were the tickets generated (Email, Phone, Chat, Web, Fax)?
  11. What is the status of the tickets
  12. How long are the open tickets aging?
  13. Are they repeat tickets for the same issue (Agents not resolving or HW/SW/NW Bugs)?
  14. Does MTTR vary between Agents, Time of Day/Week/Month, Application, Region, Business Unit, etc. Are there any observable SPC patterns in the MTTR data?
  15. Does quality of the documentation vary between Agents, Time of Day/Week/Month, etc.
  16. Do the negative CSAT surveys map to Agents, Time of Day/Week/Month, Application, Resolver Group, Business Unit, Country, etc.
  17. Is any training to the agents or requestors needed based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
  18. Is any update to the KB documentation needed based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
  19. Does the Self-Help Portal need to be updated based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
  20. Do we need to distribute e-newsletters, FAQs, have DSUG meetings based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
  21. Are Tickets being flagged in a way to exclude them from SLA calculations and CSAT surveys
  22. Are Agents prolonging tickets at certain periods of the day to avoid taking additional calls
  23. Are Agents offloading tickets at certain periods of the day (Breaks, Lunches, Dinner, Shift End)
  24. How is the case to call ratio?
  25. Is ticket documentation quality varying based on agents, time of day, etc.?
  26. Are Agents closing tickets prematurely to inflate FCR?
  27. Are Agents referring higher amount of tickets at certain times of the day?
  28. Are Customers requesting L1.5/DSS teams versus working with SD?
  29. Are repeat calls confined to a certain group of agents?
  30. Were the Business requirements not clear for the implementation which ultimately led to an outage immediately after implementation of shortly afterwards (i.e. a few business days in or week etc.)
  31. Were the Business requirements clear though not implemented properly?  What was the cause (procedural, tools, human error etc.)?
  32. Did we make changes without consulting with the business for any incidents?
  33. Are the incidents (or what percentage of them) as a result of failed Technical or Business PIV or both?
  34. Was there PIV performed by both areas after implementation?  If not, why (i.e. not required, oversight, Business not available, Technical resource shortages etc.)
  35. How many incidents were related to the TEST/UAT environment not being like for like Production thus incomplete testing?
  36. Is the ratio of incidents in Application or support Infrastructure suite higher than the other LOB applications?
  37. Is the largest percentage of Application outages isolated to a finite group of?   If so, what is that telling us?
  38. Is the ratio of Changes larger against a specific set of Applications/Infrastructure versus the remainder of Applications? Why?
  39. Is the ratio of Applications being downgraded in Priority levels as a result of incorrect batch flow automation or human error when creating the ticket?  What ratio of each type?
  40. Is it the same suite of Applications being incorrectly categorized with respect to the priority being downgraded?
  41. Are there any incidents being reported under different Priority ratings (i.e. similar impact though reported as P1, P2, P3, P4 at different times) thus not consistent?
  42. How many incidents are related to changes that were implemented to correct other errors?
  43. How many changes have been implemented to correct improperly applied changes?
  44. How many and which Applications/Infrastructure have a higher ratio of incidents related to Single Points of Failure, Technology Currency, Patching and/or Resiliency exposures?
  45. Do we have a proportionate number of incidents related to Software versus Hardware issues?
  46. Can we ascertain how many incidents are repeats and/or could have been avoided if we properly executed on the Problem Management Process to determine & execute on Root Cause/Avoidance at the first incident?
  47. For the incidents identified as repeat was the problem management process performed with the first incident or not and why?
  48. If the PM process was executed & RCA clearly identified why were we not able to avert the subsequent outage?
  49. How long after a Change has been implemented did we suffer the first outage (i.e. when a new feature is first utilized etc.)
  50. What are the common incident outage themes (i.e. People, Process, Documentation, Technical, Tools) across both Application/Infrastructure?
  51. On the Infrastructure side can we ascertain outage ratios against Database-Middleware, Mainframe, Mid-Range, Windows, Virtual, Network, Storage, Unix etc. to strike a theme?  We can then dig further into this output.
  52. For incidents – how many do we learn from the client (i.e. to Service Desk) first before we know and can react on the Application-Infrastructure side via alerting/monitoring?
  53. What is the normal time gap as this occurs?
  54. Where can we implement synthetic testing of application functionality or transactions to simulate a flow and put up an alert if a certain condition is met before a client calls (i.e. similar to some Online Banking test scripts)?
  55. For alerting/monitoring – if we have a reaction time gap is this because of short staff on the Application/Infrastructure side to deal with the volume?
  56. For alerting/monitoring – if we have a reaction time gap could it be because we are reliant on a non-Command Centre 7/24 structure (eyes on glass) or Follow the Sun Model to react real time as opposed to the alert being sent to a L2/L3 support person that is offsite and the support person needs to login in to validate the error, then take action resulting in delays?
  57. For alerting/monitoring – are we seeing a negative trend with either Infrastructure versus Application alerting & reaction times?
  58. How many incidents do not have alerting/monitoring defined due to oversight or because we cannot monitor a service, function, alert etc.
  59. What is the average time lag between an Alert->Business Impact->Incident Ticket Open-Resolved-Closed-MTTR-Problem Ticket Open/Close
  60. What are the trend Root Cause Codes (Infrastructure/Application)?
  61. What is the time delay to Root Cause identification for Application/Infrastructure/3rd Party-Vendor?
  62. Are the Incident trends more prevalent during the Monday-Friday time period versus Saturday-Sunday?
  63. Are the Incident trends more prevalent during the 8AM-4PM, 4PM-Midnight, Midnight to 08:00AM time periods on either Monday-Friday versus Saturday-Sunday?
  64. Are the Incident trends more prevalent during the Month End, First Business Day, Mid-Month, Quarter End time frames?
  65. Are the incident trends more prevalent during short staffing periods or compressed timelines for projects/processing times etc. thus resulting in a higher ratio of incidents (i.e. Rushed Work is not Quality Work)?
  66. As per industry trend are we worse than, equal to or better than a potential peer group (i.e. another FI or otherwise)?
  67. What is the competition doing that we are not if our Application/Infrastructure architecture-configuration, size, interfaces and dependencies are equal and we have a higher ratio of outages?
  68. How much data or tickets did we discard during this exercise? Was it material that could have altered the outcome of this report?
  69. Did you surface trends by users/groups?
  70. The automated alerting which was reported –was it more prevalent in one Application or portfolio of applications?
  71. Were there specific trends on a day of the week?
  72. Do we have more details on repeat trends?
  73. Were you able to report on trends relative to alert/outage/ticket open-response times and the gaps within?
  74. We need to create a Service Management road show which includes a Service Desk/Application support Incident engagement flow in order to educate the field. We have done something like this with the other Service desks.
  75. Are tickets being addressed at the appropriate layers (Service Desk, Tier2, Management etc.)
  76. Proactive Trend Analysis needs to be done consistently at the Application level. How will this be introduced?
  77. Are the trends/spikes in line with the interfacing apps which feed the highlight applications in this report?
  78. Alert Settings –Are the Performance & Capacity Management settings being reviewed with the Application space with respect to Trends/Insights?
  79. Do we have more details around Change Related Event-Incident Trends?
  80. Do you have more details around Vendor related incidents to extract trends?
  81. How can we expand on the inbound quality issues (i.e. feeder applications)?
  82. What are we learning or missing in the P3-P4 trends?
  83. Why are certain Service Request volumes higher across the portfolio of applications?
  84. Did we see behaviors across the Applications that are consistent within a specific dept?
  85. There was a higher # of alerts. Do we know how?
  86. Did we extract any Infrastructure related Alert-Incident data to match the themes as part of the overall exercise?
  87. Are there recommendations in here that support the establishment of an Application Command Centre Model (i.e. 7/24 eyes on glass support)?
  88. Who is receiving reporting on these negative trends or address tickets in their queues?
  89. Who will review the Alert to Incident variables to ensure a sanity check has been done


Signup to read full articles

Ready to listen to what your data is telling you?

Book A Consultation

Subscribe to our Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.