Skip to content

Features ​

OnCallM provides comprehensive AI-powered alert analysis for Kubernetes environments.

Core Features ​

🧠 Smart Root Cause Analysis ​

OnCallM's AI engine analyzes multiple data sources to identify the true root cause of incidents:

  • Log Analysis: Parses application and system logs for error patterns
  • Metric Correlation: Correlates alerts with performance metrics
  • Cluster State: Examines Kubernetes resource states and events
  • Historical Context: Leverages past incidents for pattern recognition

Example Analysis:

Root Cause: Memory leak in user-service deployment
Evidence:
- Memory usage increased 300% over 2 hours
- OOMKilled events in pod logs
- Garbage collection failures in application logs
- Similar pattern detected in incident #2431

Recommendation:
1. Restart affected pods: kubectl delete pods -l app=user-service
2. Increase memory limits to 2Gi
3. Review code changes from PR #1234

⚑ Lightning Fast Response ​

Get detailed analysis in seconds, not minutes:

  • < 5 seconds: Alert processing and queueing
  • < 30 seconds: AI analysis completion
  • < 1 second: Report generation and delivery

Performance Metrics:

  • 99th percentile response time: 45 seconds
  • Average analysis time: 12 seconds
  • Concurrent alert processing: Up to 100 alerts

☸️ Kubernetes Native ​

Built specifically for Kubernetes with deep understanding of:

  • Workload Types: Deployments, StatefulSets, DaemonSets, Jobs
  • Networking: Services, Ingress, NetworkPolicies
  • Storage: PVCs, StorageClasses, Volume mounts
  • Security: RBAC, SecurityContexts, PodSecurityPolicies
  • Observability: Metrics, Logs, Events, Traces

🎯 Actionable Recommendations ​

Every analysis includes specific, actionable steps:

  • Immediate Actions: Quick fixes to resolve the incident
  • Root Cause Remediation: Steps to prevent recurrence
  • Monitoring Improvements: Suggestions for better observability
  • Capacity Planning: Resource optimization recommendations

πŸ’ Easy Integration ​

Simple webhook integration with existing tools:

  • AlertManager: Native webhook support
  • Prometheus: Metric correlation and analysis
  • Grafana: Dashboard integration
  • Slack/Teams: Notification integration
  • PagerDuty: Incident management integration

πŸ“Š Rich Analytics ​

Comprehensive analysis reports with:

  • Visual Timeline: Incident progression over time
  • Resource Graphs: CPU, memory, network trends
  • Event Correlation: Related Kubernetes events
  • Impact Assessment: Affected services and users
  • Historical Trends: Pattern analysis across incidents

Advanced Features ​

Multi-Cluster Support ​

Monitor multiple Kubernetes clusters:

yaml
clusters:
  - name: production
    webhook_url: /webhook/production
    priority: high
  - name: staging
    webhook_url: /webhook/staging
    priority: medium

Custom Analysis Workflows ​

Define custom analysis logic:

python
# Custom analyzer
class DatabaseAnalyzer:
    def analyze(self, alert):
        if 'database' in alert.labels:
            return self.analyze_database_incident(alert)

Alert Correlation ​

Intelligent alert grouping and correlation:

  • Temporal Correlation: Related alerts within time windows
  • Service Correlation: Alerts affecting same services
  • Infrastructure Correlation: Node-level incident correlation
  • Dependency Correlation: Upstream/downstream service impacts

Trend Analysis ​

Long-term pattern recognition:

  • Seasonal Patterns: Daily, weekly, monthly trends
  • Capacity Trends: Resource usage growth patterns
  • Failure Patterns: Common failure modes and triggers
  • Performance Trends: SLA and performance degradation patterns

AI Capabilities ​

Natural Language Processing ​

  • Log Parsing: Extract meaning from unstructured logs
  • Error Classification: Categorize errors by type and severity
  • Intent Recognition: Understand alert context and urgency
  • Summary Generation: Create human-readable incident summaries

Machine Learning ​

  • Anomaly Detection: Identify unusual patterns in metrics
  • Predictive Analysis: Forecast potential issues
  • Classification: Automatically categorize incidents
  • Clustering: Group similar incidents for pattern analysis

Knowledge Graph ​

  • Service Dependencies: Map service relationships
  • Infrastructure Topology: Understand cluster architecture
  • Historical Knowledge: Learn from past incidents
  • Best Practices: Apply industry knowledge to recommendations

Integration Features ​

API Ecosystem ​

Comprehensive API coverage:

  • REST API: Full CRUD operations for all resources
  • GraphQL: Flexible query interface
  • Webhook API: Event-driven integrations
  • Metrics API: Prometheus-compatible metrics

Data Export ​

Export analysis data in multiple formats:

  • JSON: Structured data for API consumption
  • CSV: Tabular data for spreadsheet analysis
  • PDF: Executive reports and documentation
  • Markdown: Documentation-friendly format

Authentication & Authorization ​

Enterprise-grade security:

  • RBAC: Role-based access control
  • SSO Integration: SAML, OAuth, OIDC support
  • API Keys: Programmatic access control
  • Audit Logging: Complete action audit trail

Monitoring & Observability ​

Self-Monitoring ​

OnCallM monitors its own health:

  • Performance Metrics: Response times, throughput, errors
  • Resource Usage: CPU, memory, storage consumption
  • Queue Health: Alert processing queue status
  • AI Service Health: OpenAI API connectivity and usage

Alerting ​

Get notified about OnCallM issues:

  • Service Degradation: Performance below thresholds
  • Queue Backlog: Alert processing delays
  • AI Service Issues: OpenAI API failures
  • Resource Exhaustion: High CPU/memory usage

Dashboards ​

Pre-built monitoring dashboards:

  • Operational Overview: System health and performance
  • Alert Analysis: Incident trends and patterns
  • Resource Utilization: Capacity planning metrics
  • User Activity: Usage patterns and adoption metrics

Enterprise Features ​

High Availability ​

Production-ready deployment options:

  • Multi-Instance: Load-balanced deployment
  • Auto-Scaling: Dynamic scaling based on load
  • Disaster Recovery: Cross-region failover
  • Data Replication: Persistent data backup

Compliance & Security ​

Meet enterprise requirements:

  • SOC 2 Type II: Security and availability compliance
  • GDPR Compliance: Data privacy and protection
  • Encryption: Data encryption at rest and in transit
  • Network Security: VPC, firewall, and network policies

Support & SLA ​

Enterprise support options:

  • 24/7 Support: Round-the-clock technical assistance
  • SLA Guarantees: 99.9% uptime commitment
  • Dedicated Success Manager: Personalized support
  • Custom Training: Team onboarding and training

Roadmap ​

Upcoming Features ​

  • Multi-Cloud Support: AWS, GCP, Azure integration
  • AI Model Selection: Choose from multiple AI providers
  • Custom Dashboards: Build personalized analysis dashboards
  • Mobile App: iOS and Android mobile access
  • Advanced Automation: Auto-remediation capabilities

Coming Soon ​

  • Incident Simulation: Test your incident response
  • Cost Analysis: Infrastructure cost optimization
  • Security Analysis: Security-focused incident analysis
  • Compliance Reporting: Automated compliance reports

Getting Started ​

Ready to explore these features?

  1. Deploy OnCallM
  2. Configure AlertManager
  3. Set up monitoring
  4. Explore the API

Questions? ​

Released under the MIT License.