Introduction
Elasticsearch is a powerful distributed search and analytics engine. But with great power comes great operational complexity. For Site Reliability Engineers (SREs), observability into these systems is non-negotiable. Poor visibility can lead to cascading failures, lost data, and frustrated users.
Note: OpenSearch is a community-driven fork of Elasticsearch 7.10 maintained by AWS. All monitoring concepts discussed here apply equally to OpenSearch unless explicitly stated.
This post explores how to monitor Elasticsearch from an SRE standpoint: focusing on performance, availability, and alerting while reducing toil and improving reliability.
Monitoring Priorities & Actions Cheat Sheet
Priority | Area | What to Monitor | Why It Matters | If Breached, Do This |
---|---|---|---|---|
π₯ Critical | Cluster Health | status != green , unassigned_shards , delayed_shards | Red/yellow clusters = major service risk | Investigate shard allocation, check logs, rebalance manually if needed |
π₯ Critical | Node Resources | JVM heap > 75%, Disk usage > 85%, CPU > 80% | Prevent OOM, GC issues, disk full crashes | Tune heap size, clean disk, scale nodes, check indexing pressure |
π§ High | Thread Pools | Rejected tasks in write , bulk , search pools | Signals overload or bottlenecks | Slow down ingestion, increase queue size, add nodes |
π§ High | Indexing & Search | High query latency, throttle_time_in_millis | Impacts end-user experience | Identify hot shards, enable slowlogs, optimize queries and mappings |
π¨ Medium | Logs | GC logs, slow logs, audit logs | Helps troubleshoot performance/security | Correlate with incidents, enable long-term log retention, adjust GC tuning |
π© Low | Node Count | Unexpected node up/down | May indicate cluster instability or scale issues | Check autoscaling events or node health, audit for configuration drift |
β Tip: Integrate these actions into your incident response playbooks or runbooks.
Why Monitoring Matters for Search Systems
Search engines are often mission-critical: they power internal dashboards, customer-facing search experiences, logs aggregation, and more. SREs need robust observability to:
- Detect node or cluster failures early
- Prevent performance degradation under load
- Optimize query latency and throughput
- Ensure shard distribution and replica health
- Manage storage and JVM memory usage
Key Monitoring Dimensions
1. Cluster Health
Metric | Description | Endpoint |
---|---|---|
status | Cluster health status: green , yellow , red | _cluster/health |
number_of_nodes | Total nodes in the cluster | _cluster/health |
active_shards | Number of active shards | _cluster/health |
unassigned_shards | Number of shards not allocated | _cluster/health |
delayed_unassigned_shards | Delayed shard assignments (could indicate slow recovery) | _cluster/health |
curl -X GET 'localhost:9200/_cluster/health?pretty'
2. Node Metrics
Metric | Description | Endpoint |
---|---|---|
jvm.mem.heap_used_percent | JVM heap memory usage percentage | _nodes/stats/jvm |
fs.total.available_in_bytes | Available disk space | _nodes/stats/fs |
process.cpu.percent | Node-level CPU utilization | _nodes/stats/process |
thread_pool.*.rejected | Rejected tasks per thread pool | _nodes/stats/thread_pool |
curl -X GET 'localhost:9200/_nodes/stats?pretty'
3. Indexing & Query Performance
Metric | Description | Endpoint |
---|---|---|
indexing.index_total | Number of indexing operations | _stats/indexing |
search.query_total | Total number of queries | _stats/search |
search.query_time_in_millis | Total time spent on queries | _stats/search |
indexing.throttle_time_in_millis | Time spent throttling indexing | _stats/indexing |
curl -X GET 'localhost:9200/_stats/indexing,search'
4. Alerts and Thresholds
Set alerts on:
Condition | Threshold |
---|---|
Cluster status | Not equal to green |
JVM heap usage | > 75% |
Disk usage | > 85% |
Thread pool rejections | Sustained increase over baseline |
Use alert grouping to avoid alert fatigue. Correlate with deployments or ingestion surges.
5. Logs and Audit Trails
Collect logs via Filebeat, Fluentd, or Data Prepper:
Log Type | Contents |
---|---|
Application Logs | Indexing/search slow logs, deprecation warnings |
GC Logs | JVM garbage collection activity |
Audit Logs | Access control and security events |
Ship them to a centralized location for querying and retention.
SRE Best Practices
Use Service-Level Objectives (SLOs)
Define SLOs for:
- Query latency (e.g. 95th percentile < 300ms)
- Error rate thresholds
- Uptime and data freshness
Use SLIs (Service-Level Indicators) to track and trigger automated responses.
Reduce Toil with Dashboards and Automation
Set up Grafana dashboards via Prometheus exporters (e.g., elasticsearch_exporter
). Automate recurring tasks such as:
- Disk space checks
- Snapshot verification
- Shard rebalancing (via curator or ILM policies)
Integrate with Incident Management
Tie alerts to your incident platform (PagerDuty, Opsgenie, etc.). Include runbooks for resolving common issues like:
- Red clusters
- Shard allocation failures
- Node restarts due to OOM
Common Gotchas
- Missing metrics: Ensure all nodes are exporting metrics and logging is consistent.
- Over-alerting: Calibrate thresholds to avoid noise.
- Cluster sprawl: Too many indices can degrade performance β use rollover and ILM.
- Heap sizing: Donβt exceed 31GB due to compressed object pointers (compressed oops).
Tools to Know
- Prometheus + elasticsearch_exporter
- Grafana dashboards
- Elastic Stack Monitoring
- OpenSearch Dashboards Monitoring
- Curator / ILM
Final Thoughts
Effective Elasticsearch monitoring is essential to delivering reliable services. SREs should aim for proactive visibility, actionable alerts, and minimum manual intervention. Monitoring should not be a reactive afterthought but a design goal.
By investing in strong observability foundations, you empower your team to move faster without sacrificing reliability.
Want to go further? Consider implementing canary queries and chaos testing for resilience under pressure.