Back to Work Sumo Logic • 2021

Alert Response Platform

Addressing the "3am problem" — a centralized platform to help site reliability engineers evaluate alerts and reduce mean time to resolution.

Role Lead UX Designer
Team 30+ cross-functional
Company Sumo Logic

Solving the 3am problem

A centralized platform to help SREs evaluate alerts and reduce mean time to resolution during off-hours incidents.

More Less

After initial success with the unified monitors project, Sumo Logic turned toward extending the feature's capability by addressing a canonical challenge in site reliability: the "3am problem."

Site reliability engineers responsible for service health often need to respond to alerts during off-hours. When a service monitor sends an alert in the middle of the night — through Slack, SMS, or other channels — engineers must log into the service, review available information, and triage issues for response. Sometimes they need additional log searches, metrics evaluation, or root cause analysis to understand the issue at hand.

The idea behind the Alert Response Platform was to provide a central area that collected available information for users to analyze and act upon. Sumo would automatically generate a report showing graphical charts of associated activity, collected log results or metrics information, and auto-generated suggestions for exploration. Users could quickly determine whether an alert was real or a false alarm, and whether it needed immediate attention or could be deferred.

Primary information positioned above the fold for rapid assessment

Ad-hoc investigations cost precious time

Every incident started with the same manual information-gathering process — time better spent resolving the issue.

More Less

Reducing the mean time to resolution (MTTR) for engineers was paramount. Without intervention, engineers would continue experiencing unnecessary delays and additional work in alert evaluation and triage. Every incident started with the same manual information-gathering process — time that could be better spent actually resolving the issue.

Multiple critical monitors requiring investigation — each one demanding manual context gathering

Leading a cross-functional effort

Led end-to-end design for a 30+ person cross-functional team spanning multiple engineering groups and geographies.

More Less

I was the lead UX designer and worked directly with a core leadership team that included a senior product manager (also responsible for the unified monitors feature), a director of product management for the advanced analytics team, and two directors of engineering — one for advanced analytics and one based in Poland leading the metrics team.

The broader team included six front-end engineers, more than twenty back-end engineers across various teams, six QA engineers, and two documentation experts. Over the course of the project, several team members left and were replaced, requiring ongoing knowledge transfer and alignment.

What I owned

  • End-to-end design leadership — Responsible for the complete design from research through implementation
  • Information architecture — Defined how alert context would be organized and presented for rapid evaluation
  • Feature integration — Coordinated the repurposing of existing analytics tools into the alert response workflow
  • Stakeholder management — Regular reviews with internal stakeholders and the broader UX team

Responsive design explorations across breakpoints

Research, collaboration, iteration

Competitive research, customer interviews, and constant engineering collaboration shaped an iterative design approach.

More Less

The process followed a similar pattern to unified monitoring. I researched existing competitor products, interviewed internal and external customers, collated research, then presented findings to the team leads.

Competitive research

Analyzed how competitors approached alert context and incident response workflows. Identified opportunities to differentiate through deeper integration with Sumo's existing analytics capabilities.

Customer interviews

Interviewed internal and external customers to understand their alert response workflows, pain points, and what information they needed most during incident triage.

Engineering collaboration

Constant discussions with engineers to understand existing capabilities and features they were proposing. Repurposed existing features into more succinct and actionable presentations.

Stakeholder reviews

Regular team meetings coordinated by product and engineering leads. Design reviews with the larger UX team in both formal sessions and ad-hoc one-on-one meetings.

Layout exploration and information hierarchy

Navigation interaction proposals

Context at a glance

Above-the-fold information hierarchy with repurposed analytics tools for automated, contextual incident insights.

More Less

To reduce time to resolution, we looked to provide a centralized area to review available information and determine next steps more quickly than ad-hoc investigations allowed. The primary success revolved around page presentation and information architecture, enabling engineers to quickly evaluate primary information, then delve deeper as needed.

Above-the-fold information hierarchy

We positioned primary information above the fold so engineers could quickly evaluate the essential details, then progressively disclose deeper context as they needed it. This approach prioritized rapid assessment over comprehensive display.

Repurposed analytics tools

We spent considerable effort repurposing existing log search analysis tools for automated presentation. LogCompare, LogReduce, LogExplain, and Root Cause Explorer became the basis for generating automated insights that engineers could explore depending on the initial results displayed.

Monitor history heatmap

We added a monitor history feature with a heatmap graph displaying the frequency of alerts for the related monitor. This provided users with context on the contemporaneous activity of the specified alert — was this a frequent occurrence or an unusual event?

Related alerts by entity and time

We attached related alerts by entity and time, enabling users to see any other alerted activity for entities that shared monitors and were in abnormal states, or alerts that were also firing for the same time period. This helped engineers understand the broader scope of an incident.

Integrated analytics tools provided automated insights for investigation

Monitor history heatmap showing alert frequency over time

A mixed reception

Adoption was tepid, but the design leadership earned a promotion to Staff UX Designer.

More Less

Adoption rates were not satisfactory. Over the course of the project, we limited the scope of information we were showing for a variety of reasons, and reception was generally tepid.

The most common feedback was that the page was simply repackaging already available information and did not offer sufficient new insights. The advanced analytics features were not reliably providing the type of feedback that would make them useful for rapid triage.

The tepid reception was not what was anticipated, and the project was not able to secure additional resources for further development as needed to improve outcomes.

Personal outcome

Despite the mixed product reception, my work on this project was recognized with a promotion to Staff UX Designer. The design leadership, cross-functional collaboration, and stakeholder management I demonstrated throughout the project contributed to this career milestone.

Responsive layouts across breakpoints

Live implementation of the Alert Response Platform

Move faster, break more things

Stronger advocacy and urgency were needed to push for a deeper, more useful product feature set.

More Less

This project taught me more about the politics of the organization, and I realized that I needed to be push harder to promote a deeper product feature set. The complexity of the project and the scope of the feature set demanded more proactive insights, but factors beyond my control had a greater impact on the outcome. Sometimes the right design decision requires not just identifying the solution, but building the internal support to see it through.

Full-width design at 1440px

History panel at smaller viewport

Next Project

SLO Monitoring →