Features β
OnCallM provides comprehensive AI-powered alert analysis for Kubernetes environments.
Core Features β
π§ Smart Root Cause Analysis β
OnCallM's AI engine analyzes multiple data sources to identify the true root cause of incidents:
- Log Analysis: Parses application and system logs for error patterns
- Metric Correlation: Correlates alerts with performance metrics
- Cluster State: Examines Kubernetes resource states and events
- Historical Context: Leverages past incidents for pattern recognition
Example Analysis:
Root Cause: Memory leak in user-service deployment
Evidence:
- Memory usage increased 300% over 2 hours
- OOMKilled events in pod logs
- Garbage collection failures in application logs
- Similar pattern detected in incident #2431
Recommendation:
1. Restart affected pods: kubectl delete pods -l app=user-service
2. Increase memory limits to 2Gi
3. Review code changes from PR #1234
β‘ Lightning Fast Response β
Get detailed analysis in seconds, not minutes:
- < 5 seconds: Alert processing and queueing
- < 30 seconds: AI analysis completion
- < 1 second: Report generation and delivery
Performance Metrics:
- 99th percentile response time: 45 seconds
- Average analysis time: 12 seconds
- Concurrent alert processing: Up to 100 alerts
βΈοΈ Kubernetes Native β
Built specifically for Kubernetes with deep understanding of:
- Workload Types: Deployments, StatefulSets, DaemonSets, Jobs
- Networking: Services, Ingress, NetworkPolicies
- Storage: PVCs, StorageClasses, Volume mounts
- Security: RBAC, SecurityContexts, PodSecurityPolicies
- Observability: Metrics, Logs, Events, Traces
π― Actionable Recommendations β
Every analysis includes specific, actionable steps:
- Immediate Actions: Quick fixes to resolve the incident
- Root Cause Remediation: Steps to prevent recurrence
- Monitoring Improvements: Suggestions for better observability
- Capacity Planning: Resource optimization recommendations
π Easy Integration β
Simple webhook integration with existing tools:
- AlertManager: Native webhook support
- Prometheus: Metric correlation and analysis
- Grafana: Dashboard integration
- Slack/Teams: Notification integration
- PagerDuty: Incident management integration
π Rich Analytics β
Comprehensive analysis reports with:
- Visual Timeline: Incident progression over time
- Resource Graphs: CPU, memory, network trends
- Event Correlation: Related Kubernetes events
- Impact Assessment: Affected services and users
- Historical Trends: Pattern analysis across incidents
Advanced Features β
Multi-Cluster Support β
Monitor multiple Kubernetes clusters:
clusters:
- name: production
webhook_url: /webhook/production
priority: high
- name: staging
webhook_url: /webhook/staging
priority: medium
Custom Analysis Workflows β
Define custom analysis logic:
# Custom analyzer
class DatabaseAnalyzer:
def analyze(self, alert):
if 'database' in alert.labels:
return self.analyze_database_incident(alert)
Alert Correlation β
Intelligent alert grouping and correlation:
- Temporal Correlation: Related alerts within time windows
- Service Correlation: Alerts affecting same services
- Infrastructure Correlation: Node-level incident correlation
- Dependency Correlation: Upstream/downstream service impacts
Trend Analysis β
Long-term pattern recognition:
- Seasonal Patterns: Daily, weekly, monthly trends
- Capacity Trends: Resource usage growth patterns
- Failure Patterns: Common failure modes and triggers
- Performance Trends: SLA and performance degradation patterns
AI Capabilities β
Natural Language Processing β
- Log Parsing: Extract meaning from unstructured logs
- Error Classification: Categorize errors by type and severity
- Intent Recognition: Understand alert context and urgency
- Summary Generation: Create human-readable incident summaries
Machine Learning β
- Anomaly Detection: Identify unusual patterns in metrics
- Predictive Analysis: Forecast potential issues
- Classification: Automatically categorize incidents
- Clustering: Group similar incidents for pattern analysis
Knowledge Graph β
- Service Dependencies: Map service relationships
- Infrastructure Topology: Understand cluster architecture
- Historical Knowledge: Learn from past incidents
- Best Practices: Apply industry knowledge to recommendations
Integration Features β
API Ecosystem β
Comprehensive API coverage:
- REST API: Full CRUD operations for all resources
- GraphQL: Flexible query interface
- Webhook API: Event-driven integrations
- Metrics API: Prometheus-compatible metrics
Data Export β
Export analysis data in multiple formats:
- JSON: Structured data for API consumption
- CSV: Tabular data for spreadsheet analysis
- PDF: Executive reports and documentation
- Markdown: Documentation-friendly format
Authentication & Authorization β
Enterprise-grade security:
- RBAC: Role-based access control
- SSO Integration: SAML, OAuth, OIDC support
- API Keys: Programmatic access control
- Audit Logging: Complete action audit trail
Monitoring & Observability β
Self-Monitoring β
OnCallM monitors its own health:
- Performance Metrics: Response times, throughput, errors
- Resource Usage: CPU, memory, storage consumption
- Queue Health: Alert processing queue status
- AI Service Health: OpenAI API connectivity and usage
Alerting β
Get notified about OnCallM issues:
- Service Degradation: Performance below thresholds
- Queue Backlog: Alert processing delays
- AI Service Issues: OpenAI API failures
- Resource Exhaustion: High CPU/memory usage
Dashboards β
Pre-built monitoring dashboards:
- Operational Overview: System health and performance
- Alert Analysis: Incident trends and patterns
- Resource Utilization: Capacity planning metrics
- User Activity: Usage patterns and adoption metrics
Enterprise Features β
High Availability β
Production-ready deployment options:
- Multi-Instance: Load-balanced deployment
- Auto-Scaling: Dynamic scaling based on load
- Disaster Recovery: Cross-region failover
- Data Replication: Persistent data backup
Compliance & Security β
Meet enterprise requirements:
- SOC 2 Type II: Security and availability compliance
- GDPR Compliance: Data privacy and protection
- Encryption: Data encryption at rest and in transit
- Network Security: VPC, firewall, and network policies
Support & SLA β
Enterprise support options:
- 24/7 Support: Round-the-clock technical assistance
- SLA Guarantees: 99.9% uptime commitment
- Dedicated Success Manager: Personalized support
- Custom Training: Team onboarding and training
Roadmap β
Upcoming Features β
- Multi-Cloud Support: AWS, GCP, Azure integration
- AI Model Selection: Choose from multiple AI providers
- Custom Dashboards: Build personalized analysis dashboards
- Mobile App: iOS and Android mobile access
- Advanced Automation: Auto-remediation capabilities
Coming Soon β
- Incident Simulation: Test your incident response
- Cost Analysis: Infrastructure cost optimization
- Security Analysis: Security-focused incident analysis
- Compliance Reporting: Automated compliance reports
Getting Started β
Ready to explore these features?
Questions? β
- π View documentation
- π Report issues
- π§ Enterprise support