Problem Management
Problem Management is a core ITSM process focused on identifying, analyzing, and eliminating the underlying root causes of incidents to prevent them from recurring and to minimize their impact on business operations.
While Incident Management focuses on restoring service quickly, Problem Management takes a deeper, investigative approach. A problem is the cause of one or more incidents. The process can be:
- Reactive: Triggered after one or more incidents have occurred.
- Proactive: Initiated by identifying trends or potential weaknesses before they cause incidents.
Benefits of Problem Management
A mature Problem Management process provides significant long-term value:
- Reduced Incident Volume: By eliminating the root causes of recurring issues, you directly reduce the number of future incidents, freeing up your service desk to focus on more strategic work.
- Improved Service Stability: Proactively identifying and resolving underlying issues leads to a more stable and reliable IT environment with less unplanned downtime.
- Increased Productivity: With fewer disruptions, both end-users and IT staff can be more productive. Technicians spend less time "firefighting" the same issues repeatedly.
- Enhanced Organizational Knowledge: The process of investigating and documenting problems, workarounds, and resolutions builds a valuable knowledge base that speeds up future diagnosis and empowers the service desk.
Problem vs. Incident Management
Understanding the distinction is key to effective service management.
| Aspect | Problem Management | Incident Management |
|---|---|---|
| Focus | Root cause identification and elimination | Quick service restoration |
| Timeline | Long-term investigation and resolution | Immediate response and recovery |
| Approach | Analytical and investigative | Reactive and urgent |
| Outcome | Prevention of future incidents | Restoration of current service |
For detailed information about handling service disruptions, see Incident Management.
The Problem Management Lifecycle
The lifecycle provides a structured path from identifying a problem to implementing a permanent solution.

1. Identification
A problem is a potential underlying cause of one or more incidents. Problems can be identified reactively (e.g., from a major incident or recurring incident analysis) or proactively (e.g., through trend analysis of system performance data).
2. Categorization
Once identified, the problem is logged and categorized based on the affected service, technology, or business area. This helps in routing the problem to the correct team and in future trend analysis.
3. Prioritization
The problem is prioritized based on its business impact and urgency, which determines the resources and time allocated to its investigation. This ensures that the most critical problems are addressed first.
4. Analysis
This is the core of the process, where a Root Cause Analysis (RCA) is performed to understand the fundamental issue. During this phase, a temporary workaround may be developed and documented in the Known Error Database (KEDB) to mitigate the impact of related incidents while a permanent solution is sought.
5. Resolution
Once the root cause is confirmed, a permanent solution is developed. This typically involves raising a Request for Change (RFC) to implement the fix in a controlled manner. The resolution is deployed and then verified to ensure it has effectively solved the problem.
6. Closure
After the resolution has been successfully implemented and verified, the problem record is formally closed. The solution is documented in the knowledge base, and all linked incidents are updated or resolved.
Common Use Cases
- Scenario 1: Reactive Problem Management
- Scenario 2: Proactive Problem Management
- Identification: The service desk notices a spike in incidents related to users being unable to access the VPN. Multiple tickets are logged for the same issue.
- Categorization & Prioritization: A problem record is created, categorized under "Network Services," and prioritized as high due to the significant impact on remote workers.
- Analysis: The network team investigates and discovers that a recent firewall update is incorrectly blocking certain IP ranges. They document a workaround (manually adding an exception for affected users) and add it to the Known Error Database.
- Resolution: The team raises an emergency RFC to roll back the faulty firewall rule and applies a corrected patch.
- Closure: After monitoring the system to confirm no new VPN incidents are logged, the problem record is closed. The final solution is documented in the knowledge base.
- Identification: During a routine review of system monitoring dashboards, a system administrator notices that the disk space on a critical database server has been trending towards full capacity faster than expected.
- Categorization & Prioritization: A problem is logged and categorized under "Server Infrastructure" and prioritized as medium to prevent a future outage.
- Analysis: Investigation reveals that a new application is generating excessive log files that are not being purged correctly.
- Resolution: A change is implemented to modify the application's log rotation script and clear out the unnecessary old files.
- Closure: The administrator verifies that the new script is working and that disk space usage has stabilized. The problem is then closed.
Roles and Responsibilities
- Problem Manager: Owns and governs the Problem Management process. They coordinate major problem investigations, ensure the process is followed, and report on key metrics.
- Problem Analyst / Coordinator: The individual or team responsible for the day-to-day management of problems. They conduct the Root Cause Analysis, document findings, and develop workarounds.
- Technical Subject Matter Experts (SMEs): Specialists from various IT teams (e.g., network, database, application support) who are brought into the investigation to provide deep technical expertise and help identify the root cause.
Key Capabilities
ServiceOps provides a robust set of tools for managing the entire problem lifecycle.
Problem & Known Error Management
- Centralized Records: Create and manage problem records and a Known Error Database (KEDB). Link multiple incidents to a single problem to track trends and impact.
- Detailed Information: Capture all relevant data, including description, priority, impact, affected services, and related CIs, directly within the problem ticket.
- Status Tracking: Monitor problems through a customizable lifecycle (e.g., Open → Investigating → Resolved → Closed).
Analysis & Investigation
- Root Cause Analysis: Document the symptoms, root cause, and business impact within the problem record.
- Workaround Documentation: Define and communicate temporary workarounds to stakeholders and automatically suggest them for related incidents.
- Data-Driven Insights: Use reporting and analytics to analyze historical incident data, identify trends, and spot potential problems proactively.
Workflow & Automation
- Automated Workflows: Use rules to automatically create a problem when a certain threshold of similar incidents is reached. Assign problems and trigger escalations based on SLAs.
- Knowledge Integration: Seamlessly link problems to knowledge base articles, known errors, and resolution guides.
- Notifications: Keep stakeholders informed with automated notifications at key stages of the lifecycle.
Measuring Success: Key Metrics (KPIs)
Key Performance Indicators for Problem Management
- Reduction in Recurring Incidents: The primary measure of success. A downward trend in incidents related to specific services or CIs shows that root causes are being effectively eliminated.
- Number of Known Errors Created: Indicates that the team is successfully identifying root causes and documenting workarounds, which helps the service desk resolve future incidents faster.
- Average Time to Identify Root Cause: Measures the efficiency of the investigation and analysis phase.
- Backlog of Open Problems: The number of unresolved problems, which helps in understanding the team's workload and identifying potential resource gaps.
Best Practices & Integrations
Best Practices
- Don't Mistake a Workaround for a Solution: A workaround restores service; a solution prevents recurrence. Ensure a full Root Cause Analysis is completed to find the true underlying cause.
- Embrace Proactive Problem Management: Regularly schedule trend analysis of incident data to identify potential problems before they cause major disruptions.
- Build a Robust Known Error Database (KEDB): A well-maintained KEDB is your most powerful tool. Document every workaround and resolution clearly to empower your service desk and speed up incident resolution.
- Separate from Incidents: While linked, ensure the problem management process is distinct from the incident process. This allows for proper, unhurried investigation time without the immediate pressure of service restoration.
Integrations
Problem management is most effective when connected to other ITSM processes.
- Incident Management: Link multiple incidents to a single problem for trend analysis and efficient resolution.
- Change Management: Raise a change request directly from a problem record to implement a permanent fix. See Change Management.
- Asset & CMDB: Associate problems with specific assets or Configuration Items (CIs) to better understand the impact.
- Knowledge Management: Populate the knowledge base with workaround details and permanent solutions. See Knowledge Management.
Security & Compliance
- Secure Data: Ensure problem data, which may contain sensitive information, is protected with role-based access controls.
- Audit Trails: Maintain a complete history of all actions taken on a problem record for compliance and review.
- Security Problems: Use specialized workflows to handle security-related problems, ensuring they are investigated by the appropriate teams.