How To Slice N Dice The Data

July 11, 2023

The Analytics & Optimization process harvests various datasets to gain insights into the historical performance of the area being reviewed. The insights are used to develop foresight about how to reshape the go-forward strategy to drive operational improvements.

As the amount of collected data increases, so do the number of reports and associated metrics. However, as the focus shifts to gathering and presenting the data, many organizations fail to analyze the data to extract insights.

As organizations embark on the analytics journey, they make substantial investments in tools, particularly cognitive tools, resources, and mobilization of great effort internally, such as daily/weekly/monthly meetings, generation of reports, and the creation of squads/taskforce. However, they realize that while they can report on the effort, the results are less than desirable.

Analytics is not about the tools, charts, or effort. It is about a mindset. The curiosity of the mind. The approach to looking at something and asking why.

The Objective of the process:

The objective of the Analytics and Optimization process is to ensure that data being collected is harvested to gain insights using a structured approach. The insights are used to optimize the operational management system to drive improvements.

Sample List of Benefits:

Overall reduction in Incidents

Improved quality of the Incidents
Reduced false alerts
Improvement in Change Success Rate
Improvement in SLAs
Improvement in Repeat issues
Improvement in Mis-Routed tickets
Improvement in Re-Opened tickets
Reduction and better management of Backlog
Improved Uptime and Availability of Applications and Infrastructure
Identification of Automation, Self-Help, and Self-Healing opportunities
Establishment of a Shift-Left Strategy
Improved management of Problem Tickets
Improved quality of RCA documents
Improved Major Incident Handling
Improved Training

Sample list of observations:

Collection and presentation of data via reports, but no analytics performed on the data.

Analytics program not established.
Roles and responsibilities, processes, tools, meetings not established.
Training not provided to conducted analytics in a structured manner.
Analytics program not structured.

Sample list of recommendations:

Ensure the collection of data is clearly defined. How is the data collected, what is the criteria, who collects it, how is it collected, where is it stored, etc.

Provide training to identified resources.
Provide processes and procedures on how to run the analytics program.
Provide tools to conduct analytics
Set up meetings to manage the process
Create reports to review the analytics outputs
Share the insights with the broader teams

Assessment Questions:

What are the top ticket drivers?
What is the time series PBA Volume view of the tickets – Year/Month/Week/Day/Hour?
What is the time series PBA MTTR view of the tickets – Year/Month/Week/Day/Hour
Are there any observable SPC patterns in the MTTR data?
Are high MTTR times due to agent, requestor, KB, etc. issues
Are high MTTR times related to same call driver? same agent? same requestor
What Users/Country/Department are generating the tickets
Who is resolving these tickets (Resolver Name/Group)
What is the priority of these tickets
How were the tickets generated (Email, Phone, Chat, Web, Fax)?
What is the status of the tickets
How long are the open tickets aging?
Are they repeat tickets for the same issue (Agents not resolving or HW/SW/NW Bugs)?
Does MTTR vary between Agents, Time of Day/Week/Month, Application, Region, Business Unit, etc. Are there any observable SPC patterns in the MTTR data?
Does quality of the documentation vary between Agents, Time of Day/Week/Month, etc.
Do the negative CSAT surveys map to Agents, Time of Day/Week/Month, Application, Resolver Group, Business Unit, Country, etc.
Is any training to the agents or requestors needed based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
Is any update to the KB documentation needed based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
Does the Self-Help Portal need to be updated based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
Do we need to distribute e-newsletters, FAQs, have DSUG meetings based on quality of the tickets, MTTR reviews, Resolution Codes, and L2 groups tickets are sent to?
Are Tickets being flagged in a way to exclude them from SLA calculations and CSAT surveys
Are Agents prolonging tickets at certain periods of the day to avoid taking additional calls
Are Agents offloading tickets at certain periods of the day (Breaks, Lunches, Dinner, Shift End)
How is the case to call ratio?
Is ticket documentation quality varying based on agents, time of day, etc.?
Are Agents closing tickets prematurely to inflate FCR?
Are Agents referring higher amount of tickets at certain times of the day?
Are Customers requesting L1.5/DSS teams versus working with SD?
Are repeat calls confined to a certain group of agents?
Were the Business requirements not clear for the implementation which ultimately led to an outage immediately after implementation of shortly afterwards (i.e. a few business days in or week etc.)
Were the Business requirements clear though not implemented properly? What was the cause (procedural, tools, human error etc.)?
Did we make changes without consulting with the business for any incidents?
Are the incidents (or what percentage of them) as a result of failed Technical or Business PIV or both?
Was there PIV performed by both areas after implementation? If not, why (i.e. not required, oversight, Business not available, Technical resource shortages etc.)
How many incidents were related to the TEST/UAT environment not being like for like Production thus incomplete testing?
Is the ratio of incidents in Application or support Infrastructure suite higher than the other LOB applications?
Is the largest percentage of Application outages isolated to a finite group of? If so, what is that telling us?
Is the ratio of Changes larger against a specific set of Applications/Infrastructure versus the remainder of Applications? Why?
Is the ratio of Applications being downgraded in Priority levels as a result of incorrect batch flow automation or human error when creating the ticket? What ratio of each type?
Is it the same suite of Applications being incorrectly categorized with respect to the priority being downgraded?
Are there any incidents being reported under different Priority ratings (i.e. similar impact though reported as P1, P2, P3, P4 at different times) thus not consistent?
How many incidents are related to changes that were implemented to correct other errors?
How many changes have been implemented to correct improperly applied changes?
How many and which Applications/Infrastructure have a higher ratio of incidents related to Single Points of Failure, Technology Currency, Patching and/or Resiliency exposures?
Do we have a proportionate number of incidents related to Software versus Hardware issues?
Can we ascertain how many incidents are repeats and/or could have been avoided if we properly executed on the Problem Management Process to determine & execute on Root Cause/Avoidance at the first incident?
For the incidents identified as repeat was the problem management process performed with the first incident or not and why?
If the PM process was executed & RCA clearly identified why were we not able to avert the subsequent outage?
How long after a Change has been implemented did we suffer the first outage (i.e. when a new feature is first utilized etc.)
What are the common incident outage themes (i.e. People, Process, Documentation, Technical, Tools) across both Application/Infrastructure?
On the Infrastructure side can we ascertain outage ratios against Database-Middleware, Mainframe, Mid-Range, Windows, Virtual, Network, Storage, Unix etc. to strike a theme? We can then dig further into this output.
For incidents – how many do we learn from the client (i.e. to Service Desk) first before we know and can react on the Application-Infrastructure side via alerting/monitoring?
What is the normal time gap as this occurs?
Where can we implement synthetic testing of application functionality or transactions to simulate a flow and put up an alert if a certain condition is met before a client calls (i.e. similar to some Online Banking test scripts)?
For alerting/monitoring – if we have a reaction time gap is this because of short staff on the Application/Infrastructure side to deal with the volume?
For alerting/monitoring – if we have a reaction time gap could it be because we are reliant on a non-Command Centre 7/24 structure (eyes on glass) or Follow the Sun Model to react real time as opposed to the alert being sent to a L2/L3 support person that is offsite and the support person needs to login in to validate the error, then take action resulting in delays?
For alerting/monitoring – are we seeing a negative trend with either Infrastructure versus Application alerting & reaction times?
How many incidents do not have alerting/monitoring defined due to oversight or because we cannot monitor a service, function, alert etc.
What is the average time lag between an Alert->Business Impact->Incident Ticket Open-Resolved-Closed-MTTR-Problem Ticket Open/Close
What are the trend Root Cause Codes (Infrastructure/Application)?
What is the time delay to Root Cause identification for Application/Infrastructure/3rd Party-Vendor?
Are the Incident trends more prevalent during the Monday-Friday time period versus Saturday-Sunday?
Are the Incident trends more prevalent during the 8AM-4PM, 4PM-Midnight, Midnight to 08:00AM time periods on either Monday-Friday versus Saturday-Sunday?
Are the Incident trends more prevalent during the Month End, First Business Day, Mid-Month, Quarter End time frames?
Are the incident trends more prevalent during short staffing periods or compressed timelines for projects/processing times etc. thus resulting in a higher ratio of incidents (i.e. Rushed Work is not Quality Work)?
As per industry trend are we worse than, equal to or better than a potential peer group (i.e. another FI or otherwise)?
What is the competition doing that we are not if our Application/Infrastructure architecture-configuration, size, interfaces and dependencies are equal and we have a higher ratio of outages?
How much data or tickets did we discard during this exercise? Was it material that could have altered the outcome of this report?
Did you surface trends by users/groups?
The automated alerting which was reported –was it more prevalent in one Application or portfolio of applications?
Were there specific trends on a day of the week?
Do we have more details on repeat trends?
Were you able to report on trends relative to alert/outage/ticket open-response times and the gaps within?
We need to create a Service Management road show which includes a Service Desk/Application support Incident engagement flow in order to educate the field. We have done something like this with the other Service desks.
Are tickets being addressed at the appropriate layers (Service Desk, Tier2, Management etc.)
Proactive Trend Analysis needs to be done consistently at the Application level. How will this be introduced?
Are the trends/spikes in line with the interfacing apps which feed the highlight applications in this report?
Alert Settings –Are the Performance & Capacity Management settings being reviewed with the Application space with respect to Trends/Insights?
Do we have more details around Change Related Event-Incident Trends?
Do you have more details around Vendor related incidents to extract trends?
How can we expand on the inbound quality issues (i.e. feeder applications)?
What are we learning or missing in the P3-P4 trends?
Why are certain Service Request volumes higher across the portfolio of applications?
Did we see behaviors across the Applications that are consistent within a specific dept?
There was a higher # of alerts. Do we know how?
Did we extract any Infrastructure related Alert-Incident data to match the themes as part of the overall exercise?
Are there recommendations in here that support the establishment of an Application Command Centre Model (i.e. 7/24 eyes on glass support)?
Who is receiving reporting on these negative trends or address tickets in their queues?
Who will review the Alert to Incident variables to ensure a sanity check has been done