Monitoring Elasticsearch: An SRE Perspective

Introduction

Elasticsearch is a powerful distributed search and analytics engine. But with great power comes great operational complexity. For Site Reliability Engineers (SREs), observability into these systems is non-negotiable. Poor visibility can lead to cascading failures, lost data, and frustrated users.

Note: OpenSearch is a community-driven fork of Elasticsearch 7.10 maintained by AWS. All monitoring concepts discussed here apply equally to OpenSearch unless explicitly stated.

This post explores how to monitor Elasticsearch from an SRE standpoint: focusing on performance, availability, and alerting while reducing toil and improving reliability.

Monitoring Priorities & Actions Cheat Sheet

Priority	Area	What to Monitor	Why It Matters	If Breached, Do This
🟥 Critical	Cluster Health	`status != green`, `unassigned_shards`, `delayed_shards`	Red/yellow clusters = major service risk	Investigate shard allocation, check logs, rebalance manually if needed
🟥 Critical	Node Resources	JVM heap > 75%, Disk usage > 85%, CPU > 80%	Prevent OOM, GC issues, disk full crashes	Tune heap size, clean disk, scale nodes, check indexing pressure
🟧 High	Thread Pools	Rejected tasks in `write`, `bulk`, `search` pools	Signals overload or bottlenecks	Slow down ingestion, increase queue size, add nodes
🟧 High	Indexing & Search	High query latency, `throttle_time_in_millis`	Impacts end-user experience	Identify hot shards, enable slowlogs, optimize queries and mappings
🟨 Medium	Logs	GC logs, slow logs, audit logs	Helps troubleshoot performance/security	Correlate with incidents, enable long-term log retention, adjust GC tuning
🟩 Low	Node Count	Unexpected node up/down	May indicate cluster instability or scale issues	Check autoscaling events or node health, audit for configuration drift

✅ Tip: Integrate these actions into your incident response playbooks or runbooks.

Why Monitoring Matters for Search Systems

Search engines are often mission-critical: they power internal dashboards, customer-facing search experiences, logs aggregation, and more. SREs need robust observability to:

Detect node or cluster failures early
Prevent performance degradation under load
Optimize query latency and throughput
Ensure shard distribution and replica health
Manage storage and JVM memory usage

Key Monitoring Dimensions

1. Cluster Health

Metric	Description	Endpoint
`status`	Cluster health status: `green`, `yellow`, `red`	`_cluster/health`
`number_of_nodes`	Total nodes in the cluster	`_cluster/health`
`active_shards`	Number of active shards	`_cluster/health`
`unassigned_shards`	Number of shards not allocated	`_cluster/health`
`delayed_unassigned_shards`	Delayed shard assignments (could indicate slow recovery)	`_cluster/health`

curl -X GET 'localhost:9200/_cluster/health?pretty'

2. Node Metrics

Metric	Description	Endpoint
`jvm.mem.heap_used_percent`	JVM heap memory usage percentage	`_nodes/stats/jvm`
`fs.total.available_in_bytes`	Available disk space	`_nodes/stats/fs`
`process.cpu.percent`	Node-level CPU utilization	`_nodes/stats/process`
`thread_pool.*.rejected`	Rejected tasks per thread pool	`_nodes/stats/thread_pool`

curl -X GET 'localhost:9200/_nodes/stats?pretty'

3. Indexing & Query Performance

Metric	Description	Endpoint
`indexing.index_total`	Number of indexing operations	`_stats/indexing`
`search.query_total`	Total number of queries	`_stats/search`
`search.query_time_in_millis`	Total time spent on queries	`_stats/search`
`indexing.throttle_time_in_millis`	Time spent throttling indexing	`_stats/indexing`

curl -X GET 'localhost:9200/_stats/indexing,search'

4. Alerts and Thresholds

Set alerts on:

Condition	Threshold
Cluster status	Not equal to `green`
JVM heap usage	> 75%
Disk usage	> 85%
Thread pool rejections	Sustained increase over baseline

Use alert grouping to avoid alert fatigue. Correlate with deployments or ingestion surges.

5. Logs and Audit Trails

Collect logs via Filebeat, Fluentd, or Data Prepper:

Log Type	Contents
Application Logs	Indexing/search slow logs, deprecation warnings
GC Logs	JVM garbage collection activity
Audit Logs	Access control and security events

Ship them to a centralized location for querying and retention.

SRE Best Practices

Use Service-Level Objectives (SLOs)

Define SLOs for:

Query latency (e.g. 95th percentile < 300ms)
Error rate thresholds
Uptime and data freshness

Use SLIs (Service-Level Indicators) to track and trigger automated responses.

Reduce Toil with Dashboards and Automation

Set up Grafana dashboards via Prometheus exporters (e.g., elasticsearch_exporter). Automate recurring tasks such as:

Disk space checks
Snapshot verification
Shard rebalancing (via curator or ILM policies)

Integrate with Incident Management

Tie alerts to your incident platform (PagerDuty, Opsgenie, etc.). Include runbooks for resolving common issues like:

Red clusters
Shard allocation failures
Node restarts due to OOM

Common Gotchas

Missing metrics: Ensure all nodes are exporting metrics and logging is consistent.
Over-alerting: Calibrate thresholds to avoid noise.
Cluster sprawl: Too many indices can degrade performance — use rollover and ILM.
Heap sizing: Don’t exceed 31GB due to compressed object pointers (compressed oops).

Tools to Know

Final Thoughts

Effective Elasticsearch monitoring is essential to delivering reliable services. SREs should aim for proactive visibility, actionable alerts, and minimum manual intervention. Monitoring should not be a reactive afterthought but a design goal.

By investing in strong observability foundations, you empower your team to move faster without sacrificing reliability.

Want to go further? Consider implementing canary queries and chaos testing for resilience under pressure.