Skip to main content

Problem Management

Problem Management is a core ITSM process focused on identifying, analyzing, and eliminating the underlying root causes of incidents to prevent them from recurring and to minimize their impact on business operations.

While Incident Management focuses on restoring service quickly, Problem Management takes a deeper, investigative approach. A problem is the cause of one or more incidents. The process can be:

  • Reactive: Triggered after one or more incidents have occurred.
  • Proactive: Initiated by identifying trends or potential weaknesses before they cause incidents.

Benefits of Problem Management

A mature Problem Management process provides significant long-term value:

  • Reduced Incident Volume: By eliminating the root causes of recurring issues, you directly reduce the number of future incidents, freeing up your service desk to focus on more strategic work.
  • Improved Service Stability: Proactively identifying and resolving underlying issues leads to a more stable and reliable IT environment with less unplanned downtime.
  • Increased Productivity: With fewer disruptions, both end-users and IT staff can be more productive. Technicians spend less time "firefighting" the same issues repeatedly.
  • Enhanced Organizational Knowledge: The process of investigating and documenting problems, workarounds, and resolutions builds a valuable knowledge base that speeds up future diagnosis and empowers the service desk.

Problem vs. Incident Management

Understanding the distinction is key to effective service management.

AspectProblem ManagementIncident Management
FocusRoot cause identification and eliminationQuick service restoration
TimelineLong-term investigation and resolutionImmediate response and recovery
ApproachAnalytical and investigativeReactive and urgent
OutcomePrevention of future incidentsRestoration of current service
note

For detailed information about handling service disruptions, see Incident Management.

The Problem Management Lifecycle

The lifecycle provides a structured path from identifying a problem to implementing a permanent solution.

1. Identification

A problem is a potential underlying cause of one or more incidents. Problems can be identified reactively (e.g., from a major incident or recurring incident analysis) or proactively (e.g., through trend analysis of system performance data).

2. Categorization

Once identified, the problem is logged and categorized based on the affected service, technology, or business area. This helps in routing the problem to the correct team and in future trend analysis.

3. Prioritization

The problem is prioritized based on its business impact and urgency, which determines the resources and time allocated to its investigation. This ensures that the most critical problems are addressed first.

4. Analysis

This is the core of the process, where a Root Cause Analysis (RCA) is performed to understand the fundamental issue. During this phase, a temporary workaround may be developed and documented in the Known Error Database (KEDB) to mitigate the impact of related incidents while a permanent solution is sought.

5. Resolution

Once the root cause is confirmed, a permanent solution is developed. This typically involves raising a Request for Change (RFC) to implement the fix in a controlled manner. The resolution is deployed and then verified to ensure it has effectively solved the problem.

6. Closure

After the resolution has been successfully implemented and verified, the problem record is formally closed. The solution is documented in the knowledge base, and all linked incidents are updated or resolved.

Common Use Cases

  1. Identification: The service desk notices a spike in incidents related to users being unable to access the VPN. Multiple tickets are logged for the same issue.
  2. Categorization & Prioritization: A problem record is created, categorized under "Network Services," and prioritized as high due to the significant impact on remote workers.
  3. Analysis: The network team investigates and discovers that a recent firewall update is incorrectly blocking certain IP ranges. They document a workaround (manually adding an exception for affected users) and add it to the Known Error Database.
  4. Resolution: The team raises an emergency RFC to roll back the faulty firewall rule and applies a corrected patch.
  5. Closure: After monitoring the system to confirm no new VPN incidents are logged, the problem record is closed. The final solution is documented in the knowledge base.

Roles and Responsibilities

  • Problem Manager: Owns and governs the Problem Management process. They coordinate major problem investigations, ensure the process is followed, and report on key metrics.
  • Problem Analyst / Coordinator: The individual or team responsible for the day-to-day management of problems. They conduct the Root Cause Analysis, document findings, and develop workarounds.
  • Technical Subject Matter Experts (SMEs): Specialists from various IT teams (e.g., network, database, application support) who are brought into the investigation to provide deep technical expertise and help identify the root cause.

Key Capabilities

ServiceOps provides a robust set of tools for managing the entire problem lifecycle.

Problem & Known Error Management
  • Centralized Records: Create and manage problem records and a Known Error Database (KEDB). Link multiple incidents to a single problem to track trends and impact.
  • Detailed Information: Capture all relevant data, including description, priority, impact, affected services, and related CIs, directly within the problem ticket.
  • Status Tracking: Monitor problems through a customizable lifecycle (e.g., Open → Investigating → Resolved → Closed).
Analysis & Investigation
  • Root Cause Analysis: Document the symptoms, root cause, and business impact within the problem record.
  • Workaround Documentation: Define and communicate temporary workarounds to stakeholders and automatically suggest them for related incidents.
  • Data-Driven Insights: Use reporting and analytics to analyze historical incident data, identify trends, and spot potential problems proactively.
Workflow & Automation
  • Automated Workflows: Use rules to automatically create a problem when a certain threshold of similar incidents is reached. Assign problems and trigger escalations based on SLAs.
  • Knowledge Integration: Seamlessly link problems to knowledge base articles, known errors, and resolution guides.
  • Notifications: Keep stakeholders informed with automated notifications at key stages of the lifecycle.

Measuring Success: Key Metrics (KPIs)

Key Performance Indicators for Problem Management
  • Reduction in Recurring Incidents: The primary measure of success. A downward trend in incidents related to specific services or CIs shows that root causes are being effectively eliminated.
  • Number of Known Errors Created: Indicates that the team is successfully identifying root causes and documenting workarounds, which helps the service desk resolve future incidents faster.
  • Average Time to Identify Root Cause: Measures the efficiency of the investigation and analysis phase.
  • Backlog of Open Problems: The number of unresolved problems, which helps in understanding the team's workload and identifying potential resource gaps.

Best Practices & Integrations

Best Practices
  • Don't Mistake a Workaround for a Solution: A workaround restores service; a solution prevents recurrence. Ensure a full Root Cause Analysis is completed to find the true underlying cause.
  • Embrace Proactive Problem Management: Regularly schedule trend analysis of incident data to identify potential problems before they cause major disruptions.
  • Build a Robust Known Error Database (KEDB): A well-maintained KEDB is your most powerful tool. Document every workaround and resolution clearly to empower your service desk and speed up incident resolution.
  • Separate from Incidents: While linked, ensure the problem management process is distinct from the incident process. This allows for proper, unhurried investigation time without the immediate pressure of service restoration.
Integrations

Problem management is most effective when connected to other ITSM processes.

  • Incident Management: Link multiple incidents to a single problem for trend analysis and efficient resolution.
  • Change Management: Raise a change request directly from a problem record to implement a permanent fix. See Change Management.
  • Asset & CMDB: Associate problems with specific assets or Configuration Items (CIs) to better understand the impact.
  • Knowledge Management: Populate the knowledge base with workaround details and permanent solutions. See Knowledge Management.
Security & Compliance
  • Secure Data: Ensure problem data, which may contain sensitive information, is protected with role-based access controls.
  • Audit Trails: Maintain a complete history of all actions taken on a problem record for compliance and review.
  • Security Problems: Use specialized workflows to handle security-related problems, ensuring they are investigated by the appropriate teams.